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Preface 



This volume contains all the papers presented at the First International Confer- 
ence on Discovery Science (DS’98), held at the Hotel Uminonakamichi, Fukuoka, 
Japan, during December 14-16, 1998. 

This conference was organized as a part of activities of the Discovery Science 
Project sponsored by Grant-in- Aid for Scientific Research on Priority Area from 
the Ministry of Education, Science, Sports and Culture (MESSC) of Japan. 
This is a three-year project starting from 1998 that aims to (1) develop new 
methods for knowledge discovery, (2) install network environments for knowledge 
discovery, and (3) establish Discovery Science as a new area of computer science. 

The DS’98 program committee selected 28 papers and 34 posters/demos from 
76 submissions. Additionally, five invited talks by Pat Langley of the Institute 
for the Study of Learning & Expertise, Heikki Mannila of University of Helsinki, 
Shinichi Morishita of University of Tokyo, Stephen Muggleton of University of 
York, and Keiichi Noe of Tohoku University were featured at the conference. 

DS’98 series provide an open forum for intensive discussions and interchange 
of new information, be it academic or business, among researchers working in 
the new area of Discovery Science. It focuses on all areas related to discovery in- 
cluding, but not limited to, the following areas: logic for /of knowledge discovery, 
knowledge discovery by inferences, knowledge discovery by learning algorithms, 
knowledge discovery by heuristic search, scientific discovery, knowledge discov- 
ery in databases, data mining, knowledge discovery in network environments, in- 
ductive logic programming, abductive reasoning, machine learning, constructive 
programming as discovery, intelligent network agents, knowledge discovery from 
unstructured and multimedia data, statistical methods for knowledge discovery, 
data and knowledge visualization, knowledge discovery and human interaction, 
and human factors in knowledge discovery. 

Continuation of the DS series is supervised by its Steering Committee con- 
sisting of Setsuo Arikawa (Chair, Kyushu Univ.), Yasumasa Kanada (Univ. 
of Tokyo), Akira Maruoka (Tohoku Univ.), Satoru Miyano (Univ. of Tokyo), 
Masahiko Sato (Kyoto Univ.), and Taisuke Sato (Tokyo Inst, of Tech.). 

DS’98 was chaired by Setsuo Arikawa (Kyushu Univ.), and assisted by the Lo- 
cal Arrangement Committee: Takeshi Shinohara (Chair, Kyushu Inst, of Tech.), 
Hiroki Arimura (Kyushu Univ.) and Ayumi Shinohara (Kyushu Univ.), It was 
cooperated by SIC of Data Mining, Japan Society for Software Science and 
Technology. 

We would like to express our immense gratitude to all the members of the 
Program Committee, which consists of: 
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Hiroshi Motoda (Chair, Osaka Univ., Japan) 

Peter Flach (Univ. of Bristol, UK) 

Koichi Furukawa (Keio Univ., Japan) 

Randy Goebel (Univ. of Alberta, Canada) 

Ross King (Univ. of Wales, UK) 

Yves Kodratolf (Univ. de Paris-Sud, Prance) 

Nada Lavrac (Jozef Stefan Inst., Slovenia) 

Wolfgang Maass (Tech. Univ. of Graz, Austria) 

Katharina Morik (Univ. of Dortmund, Germany) 

Shinichi Morishita (Univ. of Tokyo, Japan) 

Koichi Niijima (Kyushu Univ., Japan) 

Toyoaki Nishida (Naist, Japan) 

Hiroakira Ono (Jaist, Japan) 

Claude Sammut (Univ. of NSW, Australia) 

Carl Smith (Univ. of Maryland, USA) 

Yuzuru Tanaka (Hokkaido Univ., Japan) 

Esko Ukkonen (Univ. of Helsinki, Finland) 

Raul Valdes-Perez (CMU, USA) 

Thomas Zeugmann (Kyushu Univ., Japan) 

The program committee members, the steering committee members, the local 
arrangement members, and subreferees enlisted by them all put a huge amount 
of work into reviewing the submissions and judging their importance and signif- 
icance, ensuring that the conference had high technical quality. We would like 
to express our sincere gratitude to the following subreferees. 



Sebastien Augier 
Toshiaki Ejima 
Toshinori Hayashi 
Daisuke Ikeda 
Juha Karkkainen 
Paul Munteanu 
Masayuki Numao 
Hiroshi Sakamoto 
Noriko Sugimoto 
Shusaku Tsumoto 
Takaichi Yoshida 



Peter Brockhausen 
Tapio Elomaa 
Koichi Hirata 
Thorsten Joachims 
Satoshi Matsumoto 
Teigo Nakamura 
Seishi Okamoto 
Michele Sebag 
Masayuki Takeda 
Tomoyuki Uchida 



Xiao-ping Du 
Russ Greiner 
Eiju Hirowatari 
Yukiyoshi Kameyama 
Tetsuhiro Miyahara 
Claire Nedellec 
Celine Rouveirol 
Shinichi Shimozono 
Izumi Takeuchi 
Takashi Washio 



We would also like to thank everyone who made this meeting possible: authors 
for submitting papers, the invited speakers for accepting our invitation, the 
steering committee, and the sponsors for their support. Special thanks are due 
to Takeshi Shinohara, the Local Arrangement Chair, for his untiring effort in 
organization, and to Ayumi Shinohara for his assistance with the preparation 
for the proceedings. 



Fukuoka, December 1998 
Osaka, December 1998 



Setsuo Arikawa 
Hiroshi Motoda 
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Abstract. This paper is intended as an investigation of scientific discov- 
eries from a philosophical point of view. In the first half of this century, 
most of philosophers rather concentrated on the logical analysis of science 
and set the problem of discovery aside. Logical positivists distinguished 
a context of discovery from a context of justification and excluded the 
former from their analysis of scientific theories. Although Popper criti- 
cized their inductivism and suggested methodological falsificationism, he 
also left the elucidation of discovery to psychological inquiries. The prob- 
lem of scientihc discovery was proprely treated for the first time by the 
“New Philosophy of Science” school in the 1960’s. Hanson reevaluated 
the method of “rctroduction” on the basis of Peirce’s logical theory and 
analysed Kepler’s astronomical discovery in detail as a typical applica- 
tion of it. Kuhn’s theory of paradigm located discoveries in the context 
of scientific revolutions. Moreover, he paid attention to the function of 
metaphor in scientific discoveries. Metaphorical use of existing terms and 
concepts to overcome theoretical difficulties often plays a decisive role in 
developping new ideas. In the period of scientific revolution, theoretical 
metaphors can open up new horizons of scientific research by way of jux- 
tapositions of heterogeneous concepts. To explicate such a complicated 
situation we need the “rhetoric” of science rather than the “logic” of 
science. 



1 Introduction 

In the field of philosophy of science, scientihc “discovery” has been a problematic 
concept because it is too ambiguous to formulate in terms of methodology of sci- 
ence. If the procedure of discovery can be reduced to a certain logical algorithm, 
it is not a philosophical but a mathematical problem. On the other hand, if it 
concerns with hashes of genius, it amounts to a kind of psychological problem. 
As a result, philosophers of science in the twentieth century almost ignored the 
problem of discovery and excluded it from their main concern. 

However, in the 1960’s, the trend of philosophy of science was radically 
changed. At that time the so-called “New Philosophy of Science” school was 
on the rise in opposition to Logical Positivism. Repre.sentative hgures of this 
school were Norwood Hanson, Thomas Kuhn, Stephen Toulmin and Paul Fey- 
erabend. They tackled, more or less, the problem of scientihc discovery from 
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philosophical as well as sociological viewpoints. It is no exaggeration to say that 
the concept of discovery was properly treated as a main subject of the philosophy 
of science for the first time in this movement. 

The purpose of this paper is to examine how the notion of “discovery” has 
been discussed and analysed along the historical development of the philoso- 
phy of science in this century. Furthermore, I would like to consider the role of 
rhetorical moments, especially the explanatory function of metaphor, in scien- 
tific discoveries. It is not my present concern to investigate the logical structure 
of discovery. For many participants of this conference will probably make the 
problem clear. 

2 A Context of Justification vs. a Context of Discovery 

In the broad sense of the word, the origin of philosophy of science goes back 
to Aristotle’s Analytica Posteriora. There he elucidated the structure of scien- 
tific reasoning, e.g., deduction, induction and abduction. But his analysis mainly 
dealt with demonstrative knowledge and did not step into the problem of dis- 
covery in empirical sciences. For the process of discovery is contingent and can 
not be explicitly formulated. In the last chapter of Analytica Posteriora Aris- 
totle characterized the ability of discovery as “i/oDc”, i.e. “insight of reason” 
or “comprehension.” [1] As is well known, in the medieval scholastic philosophy 
this ability was called lumen naturale, which stood for natural light. Even in the 
modern era, such tradition has dominated the main stream of the philosophy of 
science, so that philosophers have been considered discovery to be an intuitive 
mental process irrelevant to their logical analysis. 

One exception was C. S. Peirce, who tried to construct the logic of discovery, 
i.e. how to find a hypothesis to solve a problem, on the basis of Aristotle’s concept 
of abduction. This is what Peirce has to say on the matter : “All the ideas of 
science come to it by the way of Abduction. Abduction consists in studying 
facts and devising a theory to explain them. Its only justification is that if we 
are ever to understand things at all, it must be in that way.” [14] He made unique 
functions of abduction in scientific research clear by contrast with the method of 
deduction and induction. Unfortunately his attempt was regarded as a deviation 
from orthodoxy and almost forgotten for a long time. It was in the end of 1950’s 
that Hanson reevaluated Peirce’s theory of “abduction” or rather “retroduction” 
in the context of contemporary philosophy of science. 

In the narrow sense of the word, the philosophy of science began with a book- 
let entitled The Scientific Conception of the World, which was a platform of the 
Vienna Circle in 1929. This group was founded by Austrian philosopher Moritz 
Schlick and consisted of philosophers, physicists and mathematicians. Their out- 
look is generally described as logical positivism and symbolized by their slogan 
“elimination of metaphysics” as well as “the unity of science.” Achieving their 
purpose, the Vienna Circle developped the method of logical analysis that was 
based on mathematical logic constructed by Frege and Russell at that time. Then 
they adopted a “verifiability principle of meaning” as a criterion of demarcation, 
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which rejected metaphysical statements as cognitively meaningless. Generally 
speaking, logical positivists regarded the task of philosophy as the logical clar- 
ification of scientific concepts and aimed at “scientific philosophy” rather than 
“philosophy of science.” 

So far as the logical structure of scientific research is concerned, analytic 
approaches of the Vienna Circle achieved a brilliant success. However, they could 
not help excluding the problem of scientific discovery from their consideration so 
as to maintain the logical clarity of analysis. Take Hans Reichenbach’s argument 
for example. After pointing out that the role of inductive inference is not to find 
a theory but to justify it, he continues as follows : 

The mystical interpretation of the hypothetico-deductive method as an 
irrational guessing springs from a confusion of context of discovery and 
context of justification. The act of discovery escapes logical analysis; 
there are no logical rules in terms of which a “discovery machine” could 
be constructed that would take over the creative function of the genius. 

But it is not the logician’s task to account for scientific discoveries; all 
he can do is to analyze the relation between given facts and a theory 
presented to him with the claim that it explains those facts. In other 
words, logic is concerned only with the context of justification. And the 
justification of a theory in terms of observational data is the subject of 
the theory of induction. [20] 

What is immediately apparent in this extract is that Reichenbach identifies 
a discovery with a casual idea or a flash. The hypothetico-deductive method, 
which is a kernel of methodology for modern science, is usually devided into 
four steps: (1) observations, (2) postulating a general hypohthesis, (3) deducing 
a test statement from it, (4) testing the statement by experiments. In general, 
the transition from the first step to the second step has been thought of the 
process of discovery that is closely connected to the procedure of induction. But 
Reichenbach maintains that the central subject of the theory of induction is 
not the second step but the fourth step, namely justification of theories. The 
second step is merely an activity of scientific guess. Thus there is no logical 
rules for promoting scientific discoveries. He leaves the explanation of mental 
mechanism of discovery to the psychological or sociological studies. The strict 
distinction between a context of discovery and a context of justification became 
an indispensable thesis of logical positivism. After that, philosophers of science 
have devoted themselves to investigate the latter and almost ignored the former 
problem. 

3 Popper’s Falsificationism 

Karl Popper was one of the most influential philosopher of science in the twen- 
tieth century. Though he associated with members of the Vienna Circle, he did 
not commit himself to the movement of logical positivism. Rather, he criticized 
their basic premise of the verifiability principle of meaning, and proposed the 
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concept of “falsifiability” instead. For the procedure of verification presupposes 
the legitimacy of induction which justihes a hypothesis on the basis of experi- 
mental data. For that purpose, logical positivists were eager in constructing the 
system of inductive logic to calculate the probability of verification. They wanted 
to give a hrm basis to scientihe theories by appealing to inductive logic. 

However, for Popper, it is evident that inductive inferences are logically in- 
valid. Induction is simply characterized as a transition from particular observa- 
tions to a general law. Logically speaking, since this transition is nothing but a 
leap from a conjunction of singular statements to a universal statement, it needs 
to be justihed by some other principles which make possible to bridge the gap, 
e.g. “the principle of uniformity of nature” by J. S. Mill. Then, where can we 
obtain such a principle? If it is not a metaphysical principle but an empirical 
one, we have to gain it by means of induction again. Obviously this argument 
contains a circular reasoning. Therefore it is impossible to justify inductive in- 
ference in a proper sense, and we cannot reach any indubitable knowledge by 
means of it. 

Such difficulties were nearly pointed out by David Hume in the middle of the 
seventeenth century. As is generally known, he headed straight for skepticism 
concerning scientific knowledge through the distrust of inductive method, and 
his skeptical conclusion was refuted by Kant’s transcendental argument. Reex- 
amining this controversy, Popper criticizes Hume for abandoning rationality. On 
the other hand, he is not necessarily satisfied with Kant’s solution. According 
to Kant’s apriorism, natural laws arc not discovered but invented by human 
intellect, as can be seen in the following quotation: “The understanding does 
not derive its laws (a priori) from, but prescribes to, nature.” [3] Popper basi- 
cally accepts this argument, but rejects Kant’s thesis of the apriority of natural 
laws. Because every scientific statement about natural laws is no other than a 
tentative hypothesis for him. As far as it belongs to empirical sciences, it must 
be always exposed to the risk of falsification. 

Consequently Popper suggests a negative solution to the foregoing “Hume’s 
Paradox.” He asserts that whereas a scientific hypothesis cannot be verihed, 
it can only be falsihed. His argument depends upon the asymmetry between 
verification and falsification. Even a vast collection of examples cannot com- 
pletely verify a universal statement. On the contrary, only one counter-example 
is enough to falsify it. If induction is a procedure of verifying a hypothesis, it 
is neither necessary for nor relevant to scientific researches. The essence of sci- 
entihe inquiry consists in the process of strict falsihcation. We are merely able 
to maintain a hypothesis as a valid theory until it will be falsihed in the future. 
In this respect we have to accept the fallible character of knowledge. This is the 
central point of Popper’s thesis of “falsihcationism.” 

From such a viewpoint. Popper characterizes the methodology of science 
as a series of “conjecture and refutation” and reformulates the structure of 
hypothetico-deductive method. Since induction is a white elephant, it would be 
useless to start with observations. What is worse, observations are complicated 
activities which should be described as “theory-impregnated” . There is no pure 
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observation without theoretical background. Therefore, to begin with, scientists 
have to start from a problem (Pi) with which they are confronted. Then they 
try to propose a tentative theory (TP) to settle the problem by means of imag- 
inative conjectures. Finally, they proceed to the procedure of error-elimination 
{EE) by a rigorous critical examination of the conjecture. If the tentative theory 
is refuted, they have to grope for some other conjecture. If not, they will grapple 
with a new problem (P 2 ). Popper summarizes the general schema of “the method 
of conjecture and refutation” as follows[15]: 

Pi ^TT ^ EE ^ P 2 . 

In this schema, a step toward a scientihc discovery corresponds to the process 
of conjecturing a tentative theory to solve the problem. The word “conjecture” 
sounds a little illogical or contingent. Then, is there any logic of scientific dis- 
covery? Popper’s answer is negative. First of all, he points out that we should 
distinguish sharply “between the process of conceiving a new idea, and the meth- 
ods and results of examining it logically” [16]. He goes on to say: 

Some might object that it would be more to the purpose to regard it as 
the business of epistemology to produce what has been called a rational 
reconstruction of the steps that have led the scientists to a discovery — 
to the finding of some new truth. But the question is: what, precisely, do 
we want to reconstruct? If it is the process involved in the stimulation 
and release of an inspiration which are to be reconstructed, then I should 
refuse to take it as the task of the logic of knowledge. Such processes are 
the concern of empirical psychology but hardly of logic[I7]. 

What the passage makes clear at once is that Popper distinguishes the logic of 
knowledge from the psychology of knowledge, and confines the task of philosophy 
of science to the former. Thus, the problem of scientific discovery is thrown out 
of it and left to the latter. There is no such thing as logical rules for discovery. 
Moreover, he adds that “every discovery contains ‘an irrational element’, or ‘a 
creative intuition’, in Bergson’s sense.” [18] In such a way of thinking, Popper 
undoutedly succeeds to the logical positivists’ distinction between a context of 
justification and a context of discovery. 

According to Popper’s falsificationism, the maxim of conjecture is, to borrow 
Feyerabend’s phrase, “anything goes.” It is all right if we use mystical intuition 
or oracle to conjecture a hypothesis. In conjecturing, the bolder the better. Such 
a view underlies the following remarks by him: “And looking at the matter 
from the psychological angle, I am inclined to think that scientific discovery is 
impossible without faith in ideas which arc of a purely speculative kind, and 
sometimes even quite hazy; a faith which is completely unwarranted from the 
point of view of science, and which, to that extent, is ‘metaphysical’.” [19] It is 
for this reason that Popper regards the elucidation of discovery as business of 
the psychology of knowledge. In this respect, the title of his first work The Logic 
of Scientific Discovery was very misleading. For it did not include any discussion 
about the logic of discovery. 
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4 The Concept of Discovery in the New Philosophy of 
Science 

From the late 1950’s to the early 1960’s, some philosophers of science took a 
critical attitude toward the dominant views of logical positivism and developped 
historical and sociological analyses of science. This new trend was later called 
the school of ‘"New Philosophy of Science.” In this stream of scientihe thought, 
they cast doubt on the distinction between a context of justification and a con- 
text of discovery, and proved that both of them were closely interrelated and 
inseparable. After that, the concept of discovery became a proper subject of the 
philosophy of science. 

Norwood Hanson’s book Patterns of Discovery laid the foundations of this 
movement. In the “introduction” he clearly says that his main concern in this 
book is not with the testing of hypotheses, but with their discovery. His argu- 
ments begins with reexamination of the concept of observation. Needless to say, 
Hanson is known as an adovocator of the thesis of “the theory-ladenness of ob- 
servation.” It means that we cannot observe any scientific fact without making 
any theoretical assumptions. Observations must reflect these assumptions. Thus 
he concludes that “There is a sense, then, in which seeing is a ‘theory-laden’ 
undertaking. Observation of x is shaped by prior knowledge of x.” [4] 

The observation is not a simple experience of accepting sense-data. It pre- 
supposes a complicated theoretical background. Excellent scientists can observe 
in familiar objects what no one else has observed before in the light of a new the- 
oretical hypothesis. In this sense, making observations and forming a hypothesis 
are one and inseparable. We should rather suggest that an observational dis- 
covery is usually preceded by a theoretical discovery. The discovery of pi-meson 
predicted by Hideki Yukawa’s physical theory was a notable example. He in- 
tended to explain the phenomena of nuclear force and finally introduced a new 
particle in terms of analogy with the photon’s role in rnedeating electromagnetic 
interactions. Hanson calls such a process of inference “retroduction” on the basis 
of Peirce’s logical theory. Relevant to this point is his following remark: 

Physical theories provide patterns within which data appear intelligible. 
They constitute a ‘conceptual Gestalt’. A theory is not pieced together 
from observed phenomena; it is rather what makes it possible to observe 
phenomena as being of a certain sort, and as related to other phenomena. 
Theories put phenomena into systems. They are built up ‘in reverse’ — 
retroductively. A theory is a cluster of conclusions in search of a premiss. 
From the observed properties of phenomena the physicist reasons his way 
towards a keystone idea from which the properties are explicable as a 
matter of course. The physicist seeks not a set of possible objects, but a 
set of possible explanations. [.5] 

Formally speaking, the rctroductive inference is simply formulated as follows: 



P^Q,Qh P. 
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It is evident that this is a logical mistake called “fallacy of affirming the conse- 
quent.” The procedure of hypothetico-deductive method provides no room for 
such a fallacy to play a positive role. Therefore logical positivists excluded retro- 
duction from their logical analysis, and left it to the task of psychology. Hanson, 
on the other hand, maintains that although retroduction is logically invalid, it 
works as reasonable inference in scientists’ actual thinking. Because retroduction 
can bring about the change of a “conceptual Gestalt” or “conceptual organiza- 
tion.” Such changes are nothing but a symptom of scientific discovery. 

However, it is rather difficult to describe the “conceptual Gestalt” in detail 
because of its vagueness. Hanson did no more than explain it figuratively by 
appealing to examples of optical Gestalts. In the meantime, Thomas Kuhn char- 
acterized its function as “paradigm” from a historical and sociological point of 
view. His main work The Structure of Scientific Revolution was published in 1962 
and radically transformed the trend of philosophy of science in the twentieth cen- 
tury. In this book Kuhn objected to logical positivists’ basic assumptions, to be 
precise, the accumulation of scientific knowledge and the the Whiggish historiog- 
raphy of science. Instead of the continuous progress of science, he proposed a new 
image of scientific development, namely the intermittent transformation of sci- 
entihe theories. To use Kuhn’s famous term, it should be calld “paradigm-shift” 
or “paradigm-change.” 

The very concept of paradigm is, to echo him, “universally recognized sci- 
entific achievements that for a time provide model problems and solutions to 
a community of practitioners.” [7] The scientific research conducted by a cer- 
tain paradigm is called “normal science,” which is compared to puzzle-solving 
activities following preestablished rules. Anomalies occur from time to time in 
normal scientific inquiries. If there seems no way to cope with them, the present 
paradigm of a discipline loses scientists’ confidence. Thus a crisis arises. In order 
to overcome the crisis, scientists try to change their conceptual framework from 
the old paradigm to the new one. This is the so-called scientific revolutions. The 
historical process of scientific development, according to Kuhn’s picture, can be 
summarized in the following sequence: 

paradigm —> normal science — > anomalies crisis 

— > scientific revolution — > new paradigm —>■ new normal science. 

The scientihe revolution involves a total reconstruction of materials and a 
new way of looking at things. Kuhn considers it as a kind of “Gestalt switch” in 
scientific thinking. From a viewpoint of his historiography, we have to distinguish 
two kinds of scientific discovery, that is to say, discoveries in the stage of normal 
science and ones in the stage of scientific revolution. To borrow Kuhn’s phrase, 
it is a distinction between “discoveries, or novelties of fact” and “inventions, 
or novelties of theory.” [8] A simple example of the former is the discoveries 
of new chemical elements on the basis of the periodic table in the nineteenth 
century. By contrast, the discoveries of Einstein’s relativity theory and Planck’s 
quantum theory are typical examples of the latter. On this point the method 
of retroduction suggested by Hanson mainly concerns the invention of a new 
theory. But Kuhn himself thinks that there is no logical algorithm to create a 
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new paradigm and entrusts the problem of paradigm-shift to some psychological 
and sociological considerations. He only suggests that: “almost always the men 
who achieve these fundamental inventions of a new paradigm have been either 
very young or very new to the field whose paradigm they change.” [9] 

5 The Function of Metaphor in Scientific Discoveries 

Although Kuhn takes a negative attitude to the logic of discovery, his theory 
of paradigm gives some hints to explicate the essence of scientific discovery. 
In the period of normal science most of discoveries have to do with novelties 
of fact and partial revision of theories. Therefore normal scientific research is 
a rather cumulative enterprise to extend the scope and precision of scientific 
knowledge. On the other hand, in the period of scientific revolution, discoveries 
are usually accompanied by a radical change of fact as well as theory. Both 
fact and assimilation to theory, observation and conceptualization, are closely 
interrelated in this kind of discovery. It consists of a complicated series of events. 
Kuhn makes several important statements on this process. 

Discovery commences with the awareness of anomaly, i.e., with the recog- 
nition that nature has somehow violated the paradigm-induced expecta- 
tions that govern normal science. It then continues with a more or less 
extended exploration of the area of anomaly. And it closes only when the 
paradigm theory has been adjusted so that the anomalous has become 
the expected. Assimilating a new sort of fact demands a more than ad- 
ditive adjustment of theory and until that adjustment is eompleted — 
until the scientist has learned to see nature in a different way — the new 
fact is not quite a scientific fact at all. [10] 

The phrase “the paradigm-induced expectation” is suggestive in this context, 
because to aware anomalies scientists must commit themselves to a particular 
paradigm. Curiously enough, scientific discovery leading to a new paradigm- 
change presupposes a committment to the former paradigm. Kuhn called such 
a paradoxical situation “the essential tension” in scientific researches. In other 
words, the dialectic of tradition and novelty is working during scientific revolu- 
tions. It is a process of trial and error, and takes time. One may say, as Kuhn in 
fact does, that “only as experiment and tentative theory are together articulated 
to a match does the discovery emerge and the theory become a paradigm.” [11] 
As a rule, a conspicuous discovery is recognized as such after the concerned sci- 
entihe revolution. The title of “discovery” might be awarded only with hindsight 
by the scientihe community. It is for this reason that the multiple aspects of 
discovery require not only epistemological but also sociological analyses. 

There is one further characteristic of discovery that we must not ignore. It is 
a metaphorical usage of scientific concepts and terms in inventing a new theory. 
Scientists, as mentioned above, have to devote themselves to normal scientific 
research in order to aware anomalies. Even when they deal with anomalies from a 
revolutionary viewpoint, they cannot help using the existing concepts and terms. 
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In such a case it is very useful for them to employ metaphor and analogy. Scien- 
tists involved in scientific revolution try to develop their new ideas by using old 
terms mctaphoricaly. One may say from what has been said that an important 
symptom of theoretical discovery is an appearance of metaphors in the context 
of scientific explanation. 

Kuhn cites many examples which seem to support this point in his article 
“What Are Scientific Revolutions?” [12] The conspicuous one is Planck’s dis- 
covery of the quantum theory. In the end of the nineteenth century, he was 
struggling with the problem of black-body radiation which was one of represen- 
tative anomalies in modern physics. The difficulty was how to explain the way 
in which the spectrum distribution of a heated body changes with temperature. 

Planck first settled the problem in 1900 by means of a classical method, 
namely the probability theory of entropy proposed by Ludwig Boltzmann. Planck 
supposed that a cavity filled with radiation contained a lot of “resonators.” This 
was obviously a metaphorical application of acoustic concept to the problem 
of thermodynamics. In the meantime, he noticed that a resonator could not 
have energy except a multiple of the energy “element.” As a result, to quote 
Kuhn’s expression, “the resonator has been transformed from a familiar sort 
of entity governed by standard classical laws to a strange creature the very 
existence of which is incompatible with traditional ways of doing physics.” [13] 
Later in 1909, Planck introduced the new concept “quantum” instead of the 
energy “element.” At the same time it turned out that the acoustic metaphor 
was not appropriate because of the discontinuous change of energy, and the 
entities called “resonators” now became “oscillators” in the expaneded context. 
Finally Planck was loaded with honors of discovering the quantum theory. 

With regard to the functions of metaphor Max Black’s interaction theory is 
broadly accepted. To take a simple example, “Man is a wolf” is a well-known 
metaphor. One can understand this by recognizing a certain similarity between 
two terms. Its content is easily transformed into a form of simile “Man is like 
a wolf.” Another, less typical, example is “Time is a beggar.” In this case, it 
must be difficult to find some similarity betweeen them. Therefore it cannot 
have any corresponding simile. Here although there is no similarity beforehand, 
a new sort of similarity is formed by the juxtaposition of two terms. A poet often 
employs such kind of metaphors to look at things from a different angle. Accord- 
ing to Black’s account, two heterogeneous concepts interact with each other in 
metaphor, and a meaning of new dimensions emerges out of our imagination. 
He suggests that “It would be more illuminating in some of these cases to say 
that the metaphor creates the similarity than to say that it formulates some 
simirality antecedently existing.” [2] In short, the essence of metaphor consists 
in creating new similarities and reorganizing our views of the world. 

One can see the creative function of metaphor not only in literary works but 
also in scientific discourses. But it must be noted that metaphorical expressions 
in scientific contexts are highly mediated by networks of scientific theories in 
comparison with literary or ordinary contexts, cf [6]. For example, the following 
statements cannot play a role of metaphor without some theoretical background; 
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“Sound is propagated by wave motion” and “Light consists of particles.” These 
opened up new ways of articulating the nature in cooperation with physical 
theories. In this sense introducing metaphor and eonstructing a new theory are 
interdependent and, as it were, two sides of the same coin. Of course Planck’s 
introduction of the acoustic concept of “resonators” into the black-body theory 
is another illustration of the same point. 

Even the statement “The earth is a planet” was a kind of metaphor before 
the Copernican Revolution. For, in the sixteenth century, people who was tied 
to Aristoterian world-view could not find any similarity between the earth and 
a planet. The heliocentric theory of Copernicus made possible to see a new simi- 
larity between them. Needless to say, nowadays we do not regard that statement 
as metaphor. We can understand it as a literal description without any diffi- 
culties. Thus metaphorical expressions gradually switch over to trivial ones and 
lose their initial inpact. Matters are quite similar in the case of ordinary lan- 
guage. Such a process overlaps with the transition from scientific revolution to 
normal science in Kuhnian picture of scientific development. It might be of no 
use employing metaphors in normal science. On the contrary, in the midst of 
scientific revolutions, making use of metaphors is indispensable means to renew 
our theoretical perspective. 



6 Conclusion 



It follows from what has been said that the metaphorical use of existing terms 
and concepts to deal with the anomalies takes an important role in the initial 
stage of scientific discoveries. One can find a number of examples in the history 
of science. Especially at times of scientihe revolution, as Kuhn pointed out, there 
is no explicit logical rules to invent a new theory. Because scientific revolutions 
include drastic changes of our basic premisses of scientihe knowledge. Logical 
procedures are certainly useful for normal scientihe research. By contrast, in the 
period of paradigm-shift, scientists cast doubt on the existing conceptual scheme 
and grope for a new categorization of phenomena. At this poit, metaphor shows 
its ability fully to alter the criteria by which terms attach to nature. It opens up 
a new sight of research through juxtapositions of different kind of ideas. 

However, formal logical analysis is not necessarily applicable to the investi- 
gation of metaphor. In this sense, it was unavoidable for logical positivists to 
exclude a context of discovery from their scientihe philosophy. The important 
role of model, metaphor, and analogy in scientihe discoveries was recognized for 
the hrst time in the discussions of New Philosophy of Science. Traditionally the 
study of hgurative expressions has been called “rhetoric.” Logic and rhetoric are 
not contradictory but complementary disciplines. Whereas logical structures of 
scientihe theories have been analyzed in detail, rhetorical aspects of scientihe 
thought have been set aside. It seems reasonable to conclude that if we are to 
search for the many-sided structure of scientihe discoveries, what we need is the 
rhetoric of science rather than the logic of science. 
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Abstract. Exploratory data mining, machine learning, and statistical 
modeling all have a role in discovery science. We describe a paleoecologi- 
cal reconstruction problem where Bayesian methods are useful and allow 
plausible inferences from the small and vague data sets available. 
Paleoecological reconstruction aims at estimating temperatures in the 
past. Knowledge about present day abundances of certain species are 
combined with data about the same species in fossil assemblages (e.g., 
lake sediments) . Stated formally, the reconstruction task has the form of a 
typical machine learning problem. However, to obtain useful predictions, 
a lot of background knowledge about ecological variation is needed. 

In paleoecological literature the statistical methods are involved vari- 
ations ol regression. We compare these methods with regression trees, 
nearest neighbor methods, and Bayesian hierarchical models. All the 
methods achieve about the same prediction accuracy on modern spec- 
imens, but the Bayesian methods and the involved regression methods 
seem to yield the best reconstructions. The advantage of the Bayesian 
methods is that they also give good estimates on the variability of the 
reconstructions . 

Keywords: Bayesian modeling, regression trees, machine learning, 
statistics, paleoecology 



1 Introduction 

Exploratory data mining, machine learning, and statistical modeling all have a 
role in discovery science, but it is not at all clear of what the applicability of 
different techniques is. In this paper we present a case study from paleoecological 
reconstruction, discussing how the nature of the problem affects the selection 
of the techniques. Our emphasis is on the methodology: we try to discuss the 
motivation between different choices of techniques. For this particular problem, 



S. Arikawa and H. Motoda (Eds.): DS’98, LNAI 1532, pp. 12-24, 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 



Learning, Mining, or Modeling? A Case Study from Paleoecology 



13 



the application of Bayesian methods seems to be useful: background knowledge 
can be encoded in the model considered, and this allows making more plausible 
inferences than, e.g., by ‘blind’ machine learning methods. 

The application domain is palcoecological reconstruction, which aims at esti- 
mating temperatures in the past. Such reconstructions are crucial, e.g., in trying 
to find out what part of the global warming is due to natural variation and 
how significant is the human influence. In palcoecological reconstruction, knowl- 
edge about present day abundances of certain species are combined with data 
about the same species in fossil assemblages (e.g., lake sediments). The present 
day abundances and present day temperatures can be viewed as the training 
set, and the abundances of the taxa in sediments arc the test set, for which 
the temperatures are not known. Stated formally, the reconstruction task thus 
has the form of a typical machine learning problem. However, to obtain useful 
predictions, a lot of background knowledge about ecological variation is needed. 

In palcoecological literature the statistical methods are involved variations 
of regression. We compare these methods with regression trees, nearest neighbor 
methods, and Bayesian hierarchical models. All the methods achieve about the 
same prediction accuracy on modern specimens, but the Bayesian methods and 
the involved regression methods seem to yield the best reconstructions. The 
advantage of the Bayesian methods is that they also give good estimates on the 
variability of the reconstructions. While exploratory pattern discovery methods 
do not have a direct use in the actual reconstruction task, they can be used to 
search for unexpected dependencies between species. 

Data The data we analyze in this paper consists of surface-sediment and lim- 
nological data from 53 subarctic lakes in northern Fennoscandia [6,7]. The data 
contains abundances of 36 taxa and measurements of 25 environmental variables 
such as mean July temperature, alcalinity, or the size of the catch area. The fos- 
sil assemblage available consists of abundances of the 36 taxa in 147 different 
depths in the sediment of one of the lakes. The deepest layer has accumulated 
approximately 10 000 years ago. 

2 The Reconstruction Problem 

In this section we describe a simplified version of the paleoecological reconstruc- 
tion problem and introduce the notation needed for the description of the models. 
We mainly use the notational conventions of [1]; a summary of notations is given 
in Table 1. Suppose we have n sites and m taxa, and let ijik be the abundance 
of taxon k at site i. Let Y = (yik) be the n x m matrix of abundances. 

Suppose that for the n sites we have measured p physico-chemical variables of 
the environment; denote by Xij the value of variable j on site f, and by X = {xij) 
the nxp matrix of these variables. The environment at site i is thus described by 
the vector x^ — {xn,Xi 2 , ■ ■ ■ ,Xim)- The problem in quantitative reconstruction 
is to obtain good estimates of the environmental variables in the past, such as 
temperature, based on fossil assemblages of abundances. 
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Table 1. Notations. 



Symbol 



Meaning 




y = {vzk) 



number of sites 
number of taxa 

number of environmental variables 
value of environmental variable j on site i 
matrix of values of environmental variables 
abundance of taxon k at site i 
matrix of abundances 



2/0 = (2/01 : 2/02, • • ■ , 2 /Om) the fossil assemblage for which the environmental 
variables are to be inferred 

xq = (xoi, 2^02, • ■ • , xop) values of environmental variables for the fossil 
assemblage; to be inferred 



Definition 1 (Quantitative reconstruction problem) Let Y = {uik) be the 
matrix of abundances and X = (xij) the matrix of environmental variables. 
Let 2/0 = (2/01,1/02, • ■ ■ , 2 /Om) be a fossil assemblage, and let Xq - (.roi , aio2 , • ■ ■ , Xop) 
be the set of values of environmental variables for the fossil assemblage. Given 
Y,X, and 7/0, reconstruct the environmental variables xq. 

In machine learning terms, the input consists of pairs {yi,Xi) of abundances 
and the corresponging environment. The task is to learn to predict Xi from yi, 
and to apply the learned function in reconstructing xq from 2/0- 

Problems and requirements The data is typically not consistent and does not dis- 
play obvious response functions for the organisms. Figure 1 shows the observed 
abundances of a typical chironomid species as the function of lake tempera- 
ture [6] . The large variation in abundances is due to the large number of both 
measured and unmeasured ecological variables. 

There are two important requirements for any reconstruction method [1]. 
First, it is essential to obtain information about the reliability of the estimates 
of environmental variables. For instance, if the estimates are known to be rough, 
then definitive conclusions against of for global warming can hardly be drawn. 
It seems obvious that the estimates cannot be exact for any method, due to the 
vagueness of data. This makes the accuracy estimation even more important. 

The other requirement is the capability to extrapolate from the training 
examples to unknown areas. The temperatures of the modern training set, for 
instance, range from 9 to 15°C with three exceptions (Figure 1), but it is not 
reasonable to assume that the temperatures to be reconstructed have been within 
this range. 

This second requirement leads to a more fundamental question about how the 
phenemenon under consideration is described or modeled by the reconstruction 
method. There are two major approaches: (1) estimate (e.g., by learning or 
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Fig. 1. Abundance of a chironomid taxon (Heterotanytarsus sp.) in 53 subarctic 
lakes. 



modeling) the modern response function U of each species by U such that 

Vi ^ 

and reconstruct an estimate Xq — U^^{yo), or (2) directly construct an estimate 
U~^ such that 

Xi « U^^{yi). 



The latter approach is typical for most machine learning techniques. However, 
modelling directly, e.g., by a regression tree, has the weakness that there 
is no guarantee at all of the form of the corresponding response function U ; 
the response functions might be ecologically impossible. This means that even 
if the model works well for the training set and shows good accuracy in cross 
validation tests, its performance in extrapolation in real reconstruction might be 
unpredictable and intolerable. 

The data set available for training is not large, only 53 lakes, but it has 
36 dimensions (taxa abundances). Whatever the method, learning appropriate 
generalizations necessary for extrapolation can obviously be difficult. In machine 
learning terms this can be seen as a question of using learning bias to expedite 
models with plausible extrapolation. 

In the following sections we consider these issues in the context of statisti- 
cal state-of-the-art methods for paleoecological reconstruction, regression tree 
learning and nearest neighbor prediction, and finally Bayesian modelling and 
inference. 
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3 Traditional Statistical Methods 

In this section we briefly describe some properties of the methods that are typ- 
ically used for reconstructions. Wc refer to [1] for an in-depth survey. 

The methods can be divided into three groups: linear regression and inverse 
linear regression, weighted averaging, and modern analogue techniques. Of these, 
in linear regression and inverse linear regression the abundances are modeled as 
linear functions of the environmental conditions, or the environmental variable is 
modeled by a linear combination of the abundances. The number of taxa can be 
decreased by using a principal components analysis, and variable transformations 
can be carried out to avoid problems with nonlinearities. 

In weighted averaging regression the concept of a unimodal response curve of 
a taxon is present. The species’ optimum value for the environmental variable x 
is assumed to be the weighted average of the x values of the sites in which the 
taxon occurs; as weight, the abundance of the taxon is used. That is, the estimate 
is 

E ri 

i=l 

Vik 

The estimate of the variance of the optimum, the tolerance, is 

E"=i y%k{x, - UkY 

E n 

^-l Vtk 



In this paper, we present results that have been obtained using the partial least 
squares addition to weighted averaging: the method is abbreviated WA-PLS [1]. 
The top panel of Figure 3 displays the reconstruction obtained with WA-PLS. 

Several other methods have also been used in reconstructions. We mention 
only canonical correspondence analysis (CCA) and modern anologue methods 
(MAT), which actually are nearest neigbour techniques. 

Remarks Space does not permit us to present a detailed critique of the existing 
methods, thus we only mention two points. 

Many of these methods have a weakness in extrapolation: the predictions tend 
to regress towards the mean. This is in many cases corrected by “deshrinking” , 
which basically just spreads the predictions farther away from the mean than 
they originally were. 

Another problem is the rigid structure of the response curves: they are often 
assumed to be symmetric, and the accuracy is assumed to be the same at all 
locations along the temperature dimension. 

4 Machine Learning Methods 

In this section we present results with two machine learning approaches: K 
nearest neighbors and regression tree learning. 
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K nearest neighbors In instance or case-based learning the idea is not to learn 
any function between the input and output attributes, but rather to compare new 
cases directly to the ones already seen. The methods store the training examples, 
and base their prediction for a new case on the most similar examples recorded 
during the “training” . As mentioned above, in paleoecology nearest neighbor 
methods have been used under the name “modern analogue techniques” . 

We implemented a simple K nearest neighbor method. Given yo, the method 
finds I such that {yi | i S /} is the set of K modern environments most similar 
to yo- The method then returns the mean of the corresponding environmen- 
tal variables, i.e., the mean of Xi,i £ I. We used Euclidian distance as the 
(dis) similarity measure and gave all the K neighbors the same weight. 

The only parameter for this method is K, the number of nearest neighbors 
considered. The best cross validation standard error 1.42°C was obtained with 
AT = 6; values of K in [2, 8] yielded almost equally good standard errors, at 
most 1.45°C in all cases. The second panel from the top in Figure 3 shows the 
reconstruction with 6 nearest neighbors. 



Regression tree learning Regression tree learning is closely related to the better 
known decision tree learning. In both cases, the idea is to recursively subdivide 
the problem space into disjoint, more homogeneous areas. Whereas decision trees 
predict the classification of a new case, regression trees make a numerical pre- 
diction based, e.g., on a linear formula on the input parameters [2]. 

We ran experiments with the publicly available demonstration version of Cu- 
bist {http://www.rulequest.com/). The best cross validation standard 
error, 1.80°C, was obtained with parameter -f, directing Cubist to actually 
combine the regression tree technique with a nearest neighbor component [8,5]. 
Without the nearest neighbor component, the best standard error of 1.90°C was 
obtained with the default settings of Cubist. This result is close to the standard 
deviation 1.92°C of the temperatures in the training set, and it is reasonable 
to say that in this case Cubist was not able learn from the data. The third 
panel from top in Figure 3 shows the reconstruction made by the regression tree 
learned by Cubist with option -i. 



Remarks The nearest neighbor method is attractive as it is simple and it has 
good cross validation accuracy. Its weakness is that, by dehnition, it cannot 
extrapolate outside the space it has been trained with. Some estimation of the 
accuracy could be derived from the homogeneity of the K nearest neighbors, in 
addition to the cross validation standard error. The regression tree method we 
tested gives poor cross validation accuracy and is unreliable in extrapolation, so 
the reconstruction result is questionable. The reconstruction results do not match 
with the results obtained with the state of the art methods of paleoecological 
reconstruction such as WA-PLS. 
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5 Bayesian Approach 

We give a brief introduction to the Bayesian approach to data analysis; see [3,4] 
for good expository treatments. We then give a statistical model for the recon- 
struction problem and present experimental results. 

Bayesian reconstruction One of the major advantages of the Bayesian approach 
is that problems can often be described in natural terms. In the case of paleoe- 
cology, for instance, a Bayesian model can specify what the plausible response 
functions U are, and also which functions are more likely than others. 

In the reconstruction problem, we apply the Bayesian approach as follows. For 
simplicity, consider only one environmental variable z. Assume the response of a 
taxon k to variable z is dehned by the set {wki , Wk 2 , ■ ■ ■ , Wkh} of parameters, and 
assume that the taxa are independent. The parameters of the model thus include 
for each taxon k the vector Wk = {wki-,Wk 2 , ■ ■ ■ , Wkh)- We can now, in principle, 
express the abundance of species k at site f as a function of the environmental 
variables at site i and the parameters of the species k/. yik = U (.t^, Wfc). We write 
W = (wi,W2, ■ ■ 

For the general reconstruction problem, we are interested in obtaining the 
posterior distribution Pr{xQ j X, Y, yo) of the unknown xo given the data X , Y , 
and yo- This can be written as 



That is, the marginal probability of xq is obtained by integrating over all possi- 
ble parameter values W for the response functions. This equation indicates an 
important property of the Bayesian approach: we do not use just the “best” 
response function for each taxon; rather, all possible response functions are con- 
sidered (in principle) and their importance for the final results is determined by 
their likelihood and prior probability. 

According to Bayes’ theorem, we can write 



Model specification In the case of one environmental variable z (such as tem- 
perature), a typical ecological response function might be a smooth unimodal 
function such as a bell-shaped curve. We use a simple model where the response 
of each taxon k is specified by such a curve 




Pr{xo,W\ X,Y,yo) 



Pr{xo,W)Pr{Y,yo\X,Xo,W) 



Pr{Y,yo I X) 

(X Pr{xo,W)Pr{Y,yo \ X,x„,W). 




( 1 ) 



where Wk = (afc, /3fc, 7fc) are the model parameters for taxon k. Given a value of 
the environmental variable z, the function specifies the suitability of the envi- 
ronment for taxon k. 
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In the literature the abundances are typically represented as fractions; for 
ease of modeling, we assume here that they are actual counts, i.e., yik indicates 
how many representatives of taxon k were observed on site i. In the model 
we assume that the abundances are Poisson distributed with parameter 
r(z,,afc,/?fc, 7 fc), i.e., 

Privik I Zi,ak,Pk,7k) = {yik'-)~^r{zi,ak, (3kr/k)'^'’" exp{-r{zi,ak, Pk,7k))- (2) 

We also assume that given the parameters, the observed abundances of the 
taxa are conditionally independent, although this is not exactly true. Thus the 
likelihood Pr{Y, yo I z, a, j3, 7 ) can be written in product form: 

n 

Pr{Y,yo I z,d,/ 3 , 7 ) = ]^Fr(yj | Zi,a,(3,j). 

1^0 

For individual site assuming that the taxa are independent, we can write 

m 

Pr{vi I Zi,a,f3,j)) = P[ Fr(yifc | Zi, ak, Pk,lk)- 
fc=i 

Here Pr{yik \ Zi, Ofc, /3fc, jk) is the probability of observing abundance y^fc at site i 
for taxon k, whose response curve is described by the three parameters ak, Pk, 
and 7 fe. 

To complete the model speciheation, we need to give the prior probabilities 
Pr{xo, W). For the priors of the parameters we write 

m 

Fr(xo,d,^, 7 ) = Pr{xo) Fr(afc)Fr(/?fc)Fr( 7 fc), 

k=l 

i.e., the priors for the unknown temperature and the individual parameters for 
the k response functions are independent. 

The prior for the temperature xq to be reconstructed is normal distribution 
with mean 12.1 and variance 0.37. The mean has been chosen to be the same 
as the mean of the modern training set. The variance is only one tenth of the 
variance of the modern training set, in order to rather make conservative than 
radical reconstructions. 

For the parameters ak, Pk, and 7 ^ we use the following priors: ak has uni- 
form distribution in [0.1,50], Pk has normal distribution with parameters 12.1 
and 3.7, and jk is assumed to be gamma distributed with parameters 50 and 5. 
The prior of ak allows any possible value with equal probability. The optimum 
temperatures Pk arc assumed to have roughly the same distribution as the tem- 
peratures. Finally, the prior of jk is defined to encourage conservative, gently 
sloping response curves. 

Computing the posterior distribution of the unknown variables in a closed 
form is typically not possible. Stochastic simulation techniques called Markov 
chain Monte Carlo (MCMC) methods, can be used to computationally approx- 
imate almost arbitrary distributions. For an overview of MCMC methods, see, 
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e.g., [4]. The following results have been obtained with Bassist [9], a general pur- 
pose MCMC tool under development at the Department of Computer Science 
at the University of Helsinki. 

Results Figure 2 shows the posterior mean of the response curve for the species 
of Figure 1. According to the model and the data (and the simulation), the 
response curve is within the broken lines with 95 % probability. This demon- 
strates a useful aspect of Bayesian reasoning: in the colder areas on the left, 
where there is not much evidence for fitting the response curve, there is also 
more variation in the posterior. On the right, where it is more obvious from 
Figure 1 that the temperatures are not suitable for this species, the response 
curve is quite certainly quite low. This should be contrasted with the weighted 
averaging methods, where the response curve is always a symmetric unimodal 
curve. 




Fig. 2. A response model for the chironomid taxon of Figure 2. 



Finally, the bottom panel of Figure 3 shows the reconstruction obtained with 
the above model. Again, the solid line is the posterior mean, and the area be- 
tween the broken lines contains 95 % of the variation. The reconstruction closely 
matches the one obtained by the statistical method used by paleoecologists. The 
cross validation standard error of the model is 1.49°C. 

Remarks A strength of full probability models is that uncertainty is represented 
naturally. Here we obtain information about the credibility of the reconstruction: 
the uncertainty is carried throughout the process from the vagueness of the data 
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to the different probabilities of response functions, and then to the probabilities 
of different reconstructions. 

Bayesian modeling is, in principle, simple: what one needs to describe is (a 
suitable abstraction of) the natural effects and causes. The model is essentially 
defined by equations 1 and 2 plus the prior distributions. The bias necessary for 
the task, that the constructed models are ecologically plausible, was easily built 
into the model, in this example, e.g., as the assumption that response curves are 
bell-shaped. Consequently the method can be assumed to function reasonably 
also for fossil data, which may require remarkable extrapolation. 

On the down-side is the fact that such models are slow to £t. Running the 
above model for the above data until convergence takes currently several hours 
on a PC. This is, however, cheap compared to the man months or year spent on 
obtaining the data in the first place. 

6 Discussion 

The role of machine learning, exploratory rule hnding methods, and statistical 
methods is one of the key issues in developing the data mining area. We have 
presented in a case study from paleoecology how these different methods can be 
applied to a relatively difficult scientihe data analysis and modeling problem. 

Experience with this case study shows that the specialii^ed statistical meth- 
ods for reconstructions developed within the paleoecological community have 
their strengths and weaknesses. They typically try to use only biologically vi- 
able assumptions about the taxa, but the treatment of nonlinearities and the 
approach to uncertainty in the data are relatively ad hoc. 

Machine learning methods arc in principle directly applicable to the recon- 
struction task, ffowever, the background knowledge about how taxa respond to 
changes in the environmental variable, temperature, is very difficult to incor- 
porate into the approaches. Therefore these methods produce reconstructions 
whose accuracy is difficult to estimate. 

Exploratory data mining methods might seem to be useless for the recon- 
struction process. This is partly true: once the taxa have been selected, other 
approaches are more useful. However, in the preliminary phase of data collec- 
tion, understanding the dependencies between different taxa can well be aided 
by looking exploratively for rules relating changes in the abundance of one taxon 
to changes in the abundances of other taxa. 

The Bayesian approach to the reconstruction problem gives a natural way of 
embedding the biological background knowledge into the model. The conceptual 
structure of the full probability model can actually be obtained directly from the 
structure of the data: the structure of the probability model is very similar to 
the ER-diagram of the data. It seems that this correspondence holds for a fairly 
large class of examples: stripping details of distributions from a bayesian models 
gives a description of the data items and their relationships, i.e., an ER-diagram. 
This is one of the reasons why Bayesian hierarchical models seem so well suited 
for data mining tasks. 
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In the Bayesian approach, estimating the uncertainty in the reconstruction is 
also simple, following immediately from the basic principles of the approach. The 
problem with the Bayesian approach is that it is relatively computer-intensive: 
the MCMC simulations can take a fair amount of time. 

In Table 2 we show the cross validation prediction accuracies of the different 
approaches. As can be seen, simple nearest neighbor methods achieve the best 
accuracy, closely followed by the Bayesian method. 



Table 2. Summary of cross validation prediction accuracies of different ap- 
proaches. Results for the methods WA — MAT are from [6]. For brevity, these 
methods are not described in detail in the text; see [1]. 



Method 


Cross validation accuracy (°C) 


WA (inverse) 


1.56 


WAtol (inverse) 


1.76 


WA (classical) 


2.06 


WAtol (classical) 


2.49 


PLS (1 component) 


1.57 


WA-PLS (1 component) 


1.53 


GLM 


2.16 


MAT (6 neighbors) 


1.48 


6 nearest neighbors 


1.42 


regression tree + NN 


1.80 


regression tree 


1.92 


a Bayesian model 


1.49 



Prediction accuracy is not by itself the goal. Comparing the reconstructions 
in Figure 3, we note that the statistical state of the art method WA-PLS and the 
Bayesian method produce similar results: there is a slight increase in temperature 
over the last 5000 years, with sudden short episodes during which the temper- 
ature falls. This hts rather well with the results obtained from other datasets. 
The reconstruction produced by Cubist has no trend at all, and the variability 
in the nearest neighbor result is quite large. 

It is somewhat surprising that a reasonably simple Bayesian model can pro- 
duce results similar to or better than those obtained by using statistical ap- 
proaches tailored to the reconstruction problem. One explanation for this is that 
the biological background knowledge is naturally represented. The other is the 
somewhat nonparametric nature of the method, allowing in practice many dif- 
ferent forms of posterior response curves. 

Preliminary experiments indicate also that small perturbations in the data 
do not change the Bayesian results, whereas such changes can have a drastic 
effect on the results of other methods. 



Learning, Mining, or Modeling? A Case Study from Paleoecology 



23 




Fig. 3. Temperature reconstructions from top to bottom: statistical state of 
the art method WA-PLS, 6 nearest neighbors, regression tree (Cubist), and a 
Bayesian model (Bassist). 
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Abstract. In this paper, we review A1 research on computational dis- 
covery and its recent application to the discovery of new scientific knowl- 
edge. We characterize five historical stages of the scientific discovery pro- 
cess, which we use as an organizational framework in describing applica- 
tions. We also identify five distinct steps during which developers or users 
can influence the behavior of a computational discovery system. Rather 
than criticizing such intervention, as done in the past, we recommend it 
as the preferred approach to using discovery software. As evidence for 
the advantages of such human-computer cooperation, we report seven 
examples of novel, computer-aided discoveries that have appeared in the 
scientific literature, along with the role that humans played in each case. 
We close by recommending that future systems provide more explicit 
support for human intervention in the discovery process. 



1 Introduction 

The process of scientific discovery has long been viewed as the pinnacle of creative 
thought. Thus, to many people, including some scientists themselves, it seems 
an unlikely candidate for automation by computer. However, over the past two 
decades, researchers in artificial intelligence have repeatedly questioned this at- 
titude and attempted to develop intelligent artifacts that replicate the act of 
discovery. The computational study of scientific discovery has made important 
strides in its short history, some of which we review in this paper. 

Artificial intelligence often gets its initial ideas from observing human behav- 
ior and attempting to model these activities. Computational scientific discovery 
is no exception, as early research focused on replicating discoveries from the 
history of disciplines as diverse as mathematics (Lenat, 1977), physics (Lang- 
ley, 1981), chemistry (Zytkow & Simon, 1986), and biology (Kulkarni & 
Simon, 1990). As the collection by Shrager and Langley (1990) reveals, these 
efforts also had considerable breadth in the range of scientific activities they 
attempted to model, though most work aimed to replicate the historical record 
only at the most abstract level. Despite the explicit goals of this early research, 
some critics (e.g.. Gillies, 1996) have questioned progress in the area because it 
dealt with scientific laws and theories already known to the developers. 



S. Arikawa and H. Motoda (Eds.): DS’98, LNAI 1532, pp. 25-39, 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 
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Although many researchers have continued their attempts to reproduce his- 
torical discoveries, others have turned their energies toward the computational 
discovery of new scientific knowledge. As with the historical research, this applied 
work covers a broad range of disciplines, including mathematics, astronomy, met- 
allurgy, physical chemistry, biochemistry, medicine, and ecology. Many of these 
efforts have led to refereed publications in the relevant scientific literature, which 
seems a convincing measure of their accomplishment. 

Our aim here is to examine some recent applications of computational scien- 
tific discovery and to analyze the reasons for their success. We set the background 
by reviewing the major forms that discovery takes in scientific domains, giving 
a framework to organize the later discussion. After this, we consider steps in the 
larger discovery process at which humans can influence the behavior of a compu- 
tational discovery system. We then turn to seven examples of computer-aided 
discoveries that have produced scientific publications, in each case considering 
the role played by the developer or user. In closing, we consider directions for 
future work, emphasizing the need for discovery aids that explicitly encourage 
interaction with humans. 



2 Stages of the Discovery Process 

The history of science reveals a variety of distinct types of discovery activity, 
ranging from the detection of empirical regularities to the formation of deeper 
theoretical accounts. Generally speaking, these activities tend to occur in a given 
order within a field, in that the products of one process influence or constrain 
the behavior of successors. Of course, science is not a strictly linear process, so 
that earlier stages may be revisited in the light of results from a later stage, but 
the logical relation provides a convenient framework for discussion. 

Perhaps the earliest discovery activity involves the formation of taxonomies . 
Before one can formulate laws or theories, one must first establish the basic 
concepts or categories one hopes to relate. An example comes from the early 
history of chemistry, when scientists agreed to classify some chemicals as acids, 
some as alkalis, and still others as salts based on observable properties like taste. 
Similar groupings have emerged in other fields like astronomy and physics, but 
the best known taxonomies come from biology, which groups living entities into 
categories and subcategories in a hierarchical manner. 

Once they have identified a set of entities, scientists can begin to discover 
qualitative laws that characterize their behavior or that relate them to each 
other. For example, early chemists found that acids tended to react with alkalis 
to form salts, along with similar connections among other classes of chemicals. 
Some qualitative laws describe static relations, whereas others summarize events 
like reactions that happen over time. Again, this process can occur only after a 
field has settled on the basic classes of entities under consideration. 

A third scientific activity aims to discover quantitative laws that state mathe- 
matical relations among numeric variables. For instance, early chemists identified 
the relative masses of hydrochloric acid and sodium hydrochloride that combine 
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to form a unit mass of sodium chloride. This process can also involve postulating 
the existence of an intrinsic property like density or specific heat, as well as esti- 
mating the property’s value for specific entities. Such numeric laws are typically 
stated in the context of some qualitative relationship that places constraints on 
their operation. 

Scientists in most helds are not content with empirical summaries and so try 
to explain such regularities, with the most typical first step involving the creation 
of str'uctural models that incorporate unobserved entities. Thus, nineteenth cen- 
tury chemists like Dalton and Avogadro postulated atomic and molecular models 
of chemicals to account for the numeric proportions observed in reactions. Initial 
models of this sort are typically qualitative in nature, stating only the compo- 
nents and their generic relations, but later models often incorporate numeric 
descriptions that provide further constraints. Both types of models are closely 
tied to the empirical phenomena they are designed to explain. 

Eventually, most scientific disciplines move beyond structural models to pro- 
cess models, which explain phenomena in terms of hypothesized mechanisms 
that involve change over time. One well-known process account is the kinetic 
theory of gases, which explains the empirical relations among gas volume, pres- 
sure, and temperature in terms of interactions among molecules. Again, some 
process models (like those in geology) are mainly qualitative, while others (like 
the kinetic theory) include numeric components, but both types make contact 
with empirical laws that one can derive from them. 

In the past two decades, research in automated scientihe discovery has ad- 
dressed each of these five stages. Clustering systems like Cluster/2 (Michal- 
ski & Stepp, 1983), AutoClass (Cheeseman et ah, 1988), and others deal with 
the task of taxonomy formation, whereas systems like NGlauber (Jones, 1986) 
search for qualitative relations. Starting with Bacon (Langley, 1981; Langley, 
Simon, Bradshaw, & Zytkow, 1987), researchers have developed a great vari- 
ety of systems that discover numeric laws. Systems like Dalton (Langley et 
ah, 1987), Stahlp (Rose & Langley, 1987), and Gell-Mann (Zytkow, 1996) 
formulate structural models, whereas a smaller group, like Mechem (Valdes- 
Perez, 1995) and Astra (Kocabas & Langley, 1998), instead construct process 
models. 

A few systems, such as Lenat’s (1977) AM, Nordhausen and Langley’s IDS 
(1993), and Kulkarni and Simon’s (1990) Kekada, deal with more than one of 
these facets, but most contributions have focused on one stage to the exclusion of 
others. Although the work to date has emphasized rediscovering laws and mod- 
els from the history of science, we will see that a similar bias holds for efforts 
at finding new scientific knowledge. We suspect that integrated discovery appli- 
cations will be developed, but only once the more focused efforts that already 
exist have become more widely known. 

This framework is not the only way to categorize scientific activity, but it 
appears to have general applicability across different fields, so we will use it 
to organize our presentation of applied discovery work. The scheme does favor 
methods that generate the types of formalisms reported in the scientific litera- 
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ture, and thus downplays the role of mainstream techniques from machine learn- 
ing. For example, decision-tree induction, neural networks, and nearest neighbor 
have produced quite accurate predictors in scientific domains like molecular bi- 
ology (Hunter, 1993), but they employ quite different notations from those used 
normally to characterize scientific laws and models. For this reason, wo will not 
focus on their application to scientific problems here. 



3 The Developer’s Role in Computational Discovery 

Although the term computational discovery suggests an automated process, close 
inspection of the literature reveals that the human developer or user plays an 
important role in any successful project. Early computational research on sci- 
entific discovery downplayed this fact and emphasized the automation aspect, 
in general keeping with the goals of artificial intelligence at the time. However, 
the new climate in AI favors systems that advise humans rather than replace 
them, and recent analyses of machine learning applications (e.g., Langley & Si- 
mon, 1995) suggest an important role for the developer. Such analyses carry over 
directly to discovery in scientific domains, and here we review the major ways 
in which developers can influence the behavior of discovery systems. 

The first step in using computational discovery methods is to formulate the 
discovery problem in terms that can be solved using existing techniques. The 
developer must first cast the task as one that involves forming taxonomies, find- 
ing qualitative laws, detecting numeric relations, forming structural models, or 
constructing process accounts. For most methods, he must also specify the depen- 
dent variables that laws should predict or indicate the phenomena that models 
should explain. Informed and careful problem formulation can greatly increase 
the chances of a successful discovery application. 

The second step in applying discovery techniques is to settle on an effective 
representation.^ The developer must state the variables or predicates used to 
describe the data or phenomena to be explained, along with the output repre- 
sentation used for taxonomies, laws, or models. The latter must include the oper- 
ations allowed when combining variables into laws and the component structures 
or processes used in explanatory models. The developer may also need to encode 
background knowledge about the domain in terms of an initial theory or results 
from earlier stages of the discovery process. Such representational engineering 
plays an essential role in successful applications of computational discovery. 

Another important developer activity concerns preparing the data or phe- 
nomena on which the discovery system will operate. Data collected by scientists 
may be quite sparse, lack certain values, be very noisy, or include outliers, and 
the system user can improve the quality of these data manually or using tech- 
niques for interpolation, inference, or smoothing. Similarly, scientists’ statements 
of empirical phenomena may omit hidden assumptions that the user can make 



’ We are not referring here to the representational formalism, such as decision trees 
or neural networks, but rather to the domain features encoded in a formalism. 
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explicit or include irrelevant statements that he can remove. Such data manip- 
ulation can also improve the results obtained through computational discovery. 

Research papers on machine discovery typically give the algorithm center 
stage, but they pay little attention to the developer’s efforts to modulate the 
algorithm’s behavior for given inputs. This can involve activities like the man- 
ual setting of system parameters (e.g., for evidence thresholds, noise tolerance, 
and halting criteria) and the interactive control of heuristic search by rejecting 
bad candidates or attending to good ones. Some systems are designed with this 
interaction in mind, whereas others support the process more surreptitiously. 
But in either case, such algorithm manipulation is another important way that 
developers and users can improve their chances for successful discoveries. 

A hnal step in the application process involves transforming the discovery 
system’s output into results that are meaningful to the scientific community. This 
stage can include manual hltering of interesting results from the overall output, 
recasting these results in comprehensible terms or notations, and interpreting 
the relevance of these results for the scientific field. Thus, such postprocessing 
subsumes both the human user’s evaluation of scientific results and their com- 
munication to scientists who will find them interesting. Since evaluation and 
communication are central activities in science, they play a crucial role in com- 
putational discovery as well. 

The literature on computational scientihe discovery reveals, though often 
between the lines, that developers’ intervention plays an important role even in 
historical models of discovery. Indeed, early critiques of machine discovery re- 
search frowned on these activities, since both developers and critics assumed the 
aim was to completely automate the discovery process. However, this view has 
changed in recent years, and the more common perspective, at least in applied 
circles, is that discovery systems should aid scientists rather than replace them. 
In this light, human intervention is perfectly acceptable, especially if the goal is 
to discover new scientific knowledge and not to assign credit. 

4 Some Computer-Aided Scientific Discoveries 

Now that we have set the stage, we are ready to report some successful applica- 
tions of AI methods to the discovery of new scientific knowledge. We organize the 
presentation in terms of the basic scientific activities described earlier, starting 
with examples of taxonomy formation, then moving on to law discovery and fi- 
nally to model construction. In each case, we review the basic scientific problem, 
describe the discovery system, and present the novel discovery that it has pro- 
duced. We also examine the role that the developer played in each application, 
drawing on the five steps outlined in the previous section. 

Although we have not attempted to be exhaustive, we did select examples 
that meet certain criteria. Valdes-Perez (1998) suggests that scientific discovery 
involves the “generation of novel, interesting, plausible, and intelligible knowl- 
edge about objects of scientific study”, and reviews four computer-aided dis- 
coveries that he argues meet this definition. Rather than repeating his analysis. 
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we have chosen instead to use publication of the result in the relevant scien- 
tihc literature as our main criterion for success, though we suspect that refereed 
publication is highly correlated with his factors. 

4.1 Stellar Taxonomies from Infrared Spectra 

Existing taxonomies of stars are based primarily on characteristics from the 
visible spectrum. However, artificial satellites provide an opportunity to make 
measurements of types that are not possible from the Earth’s surface, and the 
resulting data could suggest new groupings of known stellar objects. One such 
source of new data is the Infrared Astronomical Satellite, which has produced 
a database describing the intensity of some 5425 stars at 94 wavelengths in the 
infrared spectrum. 

Cheeseman et al. (1988) applied their AutoClass system to these infrared 
data. They designed this program to form one-level taxonomies, that is, to group 
objects into meaningful classes or clusters based on similar attribute values. For 
this domain, they chose to represent each cluster in terms of a mean and variance 
for each attribute, thus specifying a Gaussian distribution. The system carries 
out a gradient descent search through the space of such descriptions, starting 
with random initial descriptions for a specified number of clusters. On each step, 
the search process uses the current descriptions to probabilistically assign each 
training object to each class, and then uses the observed values for each object to 
update class descriptions, repeating this process until only minor changes occur. 
At a higher level, AutoClass iterates through different numbers of clusters to 
determine the best taxonomy, starting with a user-specified number of classes 
and increasing this count until it produces classes with negligible probabilities. 

Application of AutoClass to the infrared data on stars produced 77 stellar 
classes, which the developers organized into nine higher-level clusters by run- 
ning the system on the cluster descriptions themselves. The resulting taxonomy 
differed significantly from the one then used in astronomy, and the collaborat- 
ing astronomers felt that it reflected some important results. These included a 
new class of blackbody stars with significant infrared excess, presumably due 
to surrounding dust, and a very weak spectral ‘bump’ at 13 microns in some 
classes that was undetectable in individual spectra. Goebel et al. (1989) recount 
these and other discoveries, along with their physical interpretation; thus, the 
results were deemed important enough to justify their publication in an refereed 
astrophysical journal. 

Although AutoClass clearly contributed greatly to these discoveries, the 
developers acknowledge that they also played an important role (Cheeseman & 
Stutz, 1996). Casting the basic problem in terms of clustering was straightfor- 
ward, but the team quickly encountered problems with the basic infrared spectra, 
which had been normalized to ensure that all had the same peak height. To ob- 
tain reasonable results, they renormalized the data so that all curves had the 
same area. They also had to correct for some negative spectral intensities, which 
earlier software used by the astronomers had caused by subtracting out a back- 
ground value. The developers’ decision to run AutoClass on its own output 
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to produce a two-level taxonomy constituted another intervention. Finally, the 
collaborating astronomers did considerable interpretation of the system outputs 
before presenting them to the scientific community. 



4.2 Qualitative Factors in Carcinogenesis 

Over 80,000 chemicals are available commercially, yet the long-term health effects 
are known for only about 15 percent of them. Even fewer definitive results are 
available about whether chemicals cause cancer, since the standard tests for 
carcinogens involve two-year animal bioassays that cost $2 million per chemical. 
As a result, there is great demand for predictive laws that would let one predict 
carcinogenicity from more rapid and less expensive measurements. 

Lee, Buchanan, and Aronis (1998) have applied the rule-induction system RL 
to the problem of discovering such qualitative laws. The program constructs a set 
of conjunctive rules, each of which states the conditions under which some result 
occurs. Like many other rule-induction methods, RL invokes a general-to-specific 
search to generate each rule, selecting conditions to add that increase the rule’s 
ability to discriminate among classes and halting when there is no improvement 
in accuracy. The system also lets the user bias this search by specifying desirable 
properties of the learned rules. 

The developers ran RL on three databases for which carcinogenicity results 
were available, including 301, 108, and 1300 chemical compounds, respectively. 
Chemicals were described in terms of physical properties, structural features, 
short-term effects, and values on potency measures produced by another system. 
Experiments revealed that the induced rules were substantially more accurate 
than existing prediction schemes, which justihed publication in the scientific 
literature (Lee et ah, 1996). They also tested the rules’ ability to classify 24 new 
chemicals for which the status was unknown at development time; these results 
were also positive and led to another scientific publication (Lee et ah, 1995). 

The authors recount a number of ways in which they intervened in the discov- 
ery process to obtain these results. For example, they reduced the 496 attributes 
for one database to only 75 features by grouping values about lesions on vari- 
ous organs. The developers also constrained the induction process by specifying 
that RL should favor some attributes over others when constructing rules and 
telling it to consider only certain values of a symbolic attribute for a given class, 
as well as certain types of tests on numeric attributes. These constraints, which 
they developed through interaction with domain scientists, took precedence over 
accuracy-oriented measures in deciding what conditions to select, and it seems 
likely that they helped account for the effort’s success. 



4.3 Quantitative Laws of Metallic Behavior 

A central process in the manufacture of iron and steel involves the removal of 
impurities from molten slag. Qualitatively, the chemical reactions that are re- 
sponsible this removal process increase in effectiveness when the slag contains 




32 



Pat Langley 



more free oxide (O^^) ions. However, metallurgists have only imperfect quan- 
titative laws that relate the oxide amount, known as the basicity of the slag, 
to dependent variables of interest, such as the slag's sulfur capacity. Moreover, 
basicity cannot always be measured accurately, so there remains a need for im- 
proved ways to estimate this intrinsic property. 

Mitchell, Sleeman, Duffy, Ingram, and Young (1997) applied computational 
discovery techniques to these scientific problems. Their Daviccand system in- 
cludes operations for selecting pairs of numeric variables to relate, specifying 
qualitative conditions that focus attention on some of the data, and finding nu- 
meric laws that relate variables within a given region. The program also includes 
mechanisms for identifying outliers that violate these numeric laws and for using 
the laws to infer the values of intrinsic properties when one cannot measure them 
more directly. 

The developers report two new discoveries in which Daviccand played a 
central role. The first involves the quantitative relation between basicity and 
sulfur capacity. Previous accounts modeled this relation using a single polynomial 
that held across all temperature ranges. The new results involve three simpler, 
linear laws that relate these two variables under different temperature ranges. 
The second contribution concerns improved estimates for the basicity of slags 
that contain Ti02 and FeO, which Daviccand inferred using the numeric laws 
it induced from data, and the conclusion that FeO has quite different basicity 
values for sulphur and phosphorus slags. These results were deemed important 
enough to appear in a respected metallurgical journal (Mitchell et al., 1997). 

Unlike most discovery systems, Daviccand encourages users to take part in 
the search process and provides explicit control points where they can influence 
choices. Thus, they formulate the problem by specifying what dependent variable 
the laws should predict and what region of the space to consider. Users also affect 
representational choices by selecting what independent variables to use when 
looking for numeric laws, and they can manipulate the data by selecting what 
points to treat as outliers. Daviccand presents its results in terms of graphical 
displays and functional forms that are familiar to metallurgists, and, given the 
user’s role in the discovery process, there remains little need for postprocessing 
to filter results. 

4.4 Quantitative Conjectures in Graph Theory 

A recurring theme in graph theory involves proving theorems about relations 
among quantitative properties of graphs. However, before a mathematician can 
prove that such a relation always holds, someone must hrst formulate it as a 
conjecture. Although mathematical publications tend to emphasize proofs of 
theorems, the process of finding interesting conjectures is equally important and 
has much in common with discovery in the natural sciences. 

Fajtlowicz (1988) and colleagues have developed Graffiti, a system that 
generates conjectures in graph theory and other areas of discrete mathematics. 
The system carries out search through a space of quantitative relations like 
'^Xi> '^yi, where each Xi and yi is some numerical feature of a graph (e.g., 
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its diameter or its largest eigenvalue), the product of such elementary features, 
or their ratio. GRAFFITI ensures that its conjectures are novel by maintaining 
a record of previous hypotheses, and filters many uninteresting conjectures by 
noting that they seem to be implied by earlier, more general, candidates. 

Graffiti has generated hundreds of novel conjectures in graph theory, many 
of which have spurred mathematicians in the area to attempt their proof or refu- 
tation. In one case, the conjecture that the ‘average distance’ of a graph is no 
greater than its ‘independence number’ resulted in a proof that appeared in 
the refereed mathematical literature (Chung, 1988). Although Graffiti was 
designed as an automated discovery system, its developers have clearly con- 
strained its behavior by specifying the primitive graph features and the types 
of relations it should consider. Data manipulation occurs through a file that 
contains qualitatively different graphs, against which the system tests its con- 
jectures empirically, and postprocessing occurs when mathematicians filter the 
system output for interesting results. 

4.5 Temporal Laws of Ecological Behavior 

One major concern in ecology is the effect of pollution on the plant and animal 
populations. Ecologists regularly develop quantitative models that are stated 
as sets of differential equations. Each such equation describes changes in one 
variable (its derivative) as a function of other variables, typically ones that can 
be directly observed. For example, Lake Glumsoc is a shallow lake in Denmark 
with high concentrations of nitrogen and phosphorus from waste water, and 
ecologists would like to model the effect of these variables on the concentration 
of phytoplankton and zooplankton in the lake. 

Todorovski, Dzeroski, and Kompare (in press) applied techniques for numeric 
discovery to this problem. Their Lagramge system carries out search through a 
space of differential equations, looking for the equation set that gives the small- 
est error on the observed data. The system uses two constraints to make this 
search process tractable. First, Lagramge incorporates background knowledge 
about the domain in the form of a context-free grammar that it uses to gen- 
erate plausible equations. Second, it places a limit on the allowed depth of the 
derivations used to produce equations. For each candidate set of equations, the 
system uses numerical integration to estimate the error and thus the quality of 
the proposed model. 

The developers report a new set of equations, discovered by Lagramge, 
that model accurately the relation between the pollution and plankton concen- 
trations in Lake Glumsoe. This revealed that phosphorus and temperature are 
the limiting factors on the growth of phytoplankton in the lake. We can infer 
Todorovski et al.’s role in the discovery process from their paper. They formu- 
lated the problem in terms of the variables to be predicted, and they engineered 
the representation both by specifying the predictive variables and by provid- 
ing the grammar used to generate candidate equations. Because the data were 
sparse (from only 14 time points over two months), they convinced three ex- 
perts to draw curves that filled in the gaps, used splines to smooth these curves. 
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and sampled from these ten times per day. They also manipulated Lagramge 
by telling it to consider derivations that were no more than four levels deep. 
However, little postprocessing or interpretation was needed, since the system 
produces output in a form familiar to ecologists. 



4.6 Chemical Structures of Mutagens 

Another area of biochemistry with important social implications aims to un- 
derstand the factors that determine whether a chemical will cause mutations 
in genetic material. One data set that contains results of this sort involves 230 
aromatic and heteroaromatic nitro compounds, which can be divided into 138 
chemicals that have high mutagenicity and 92 chemicals that are low on this 
dimension. Structural models that characterize these two classes could prove 
useful in predicting whether new compounds pose a danger of causing mutation. 

King, Muggleton, Srinivasan, and Sternberg (1996) report an application of 
their Progol system to this problem. The program operates along lines similar 
to other rule-induction methods, in that it carries out a general-to-specific search 
for a conjunctive rule that covers some of the data, then repeats this process to 
find additional rules that cover the rest. The system also lets the user specify 
background knowledge, stated in the same form, which it takes into account in 
measuring the quality of induced rules. Unlike most rule-induction techniques, 
Progol assumes a predicate logic formalism that can represent relations among 
objects, rather than just attribute values. 

This support for relational descriptions led to revealing structural descrip- 
tions of mutation factors. For example, for the data set mentioned above, the 
system found one rule predicting that a compound is mutagenic if it has “a 
highly aliphatic carbon atom attached by a single bond to a carbon atom that is 
in a six-membered aromatic ring”. Combined with four similar rules, this charac- 
terization gave 81% correct predictions, which is comparable to the accuracy of 
other computational methods. However, alternative techniques do not produce 
a structural model that one can use to visualize spatial relations and thus to 
posit the deeper causes of mutation,^ so that the results justihed publication in 
the chemistry literature (King et al., 1996). 

As in other applications, the developers aided the discovery process in a 
number of ways. They chose to formulate the task in terms of finding a classiher 
that labels chemicals as causing mutation or not, rather than predicting levels of 
mutagenicity. King et al. also presented their system with background knowledge 
about methyl and nitro groups, the length and connectivity of rings, and other 
concepts. In addition, they manipulated the data by dividing into two groups 
with different characteristics, as done earlier by others working in the area. 

^ This task does not actually involve structural modeling in the sense discussed in 
Section 2, since the structures are generalizations from observed data rather than 
combinations of unobserved entities posited to explain phenomena. However, appli- 
cations of such structural modeling do not appear in the literature, and the King 
et al. work seems the closest approximation. 
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Although the induced rules were understandable in that they made clear contact 
with chemical concepts, the authors aided their interpretation by presenting 
graphical depictions of their structural claims. Similar interventions have been 
used by the developers on related scientific problems, including prediction of 
carcinogenicity (King & Srinivasan, 1996) and pharmacophore discovery (Finn, 
Muggleton, Page, & Srinivasan, 1998). 



4.7 Reaction Pathways in Catalytic Chemistry 



For a century, chemists have known that many reactions involve, not a single 
step, but rather a sequence of primitive interactions. Thus, a recurring problem 
has been to formulate the sequence of steps, known as the reaction pathway, for 
a given chemical reaction. In addition to the reactants and products of the reac- 
tion, this inference may also be constrained by information about intermediate 
products, concentrations over time, relative quantities, and many other factors. 
Even so, the great number of possible pathways makes it possible that scientists 
will overlook some viable alternatives, so there exists a need for computational 
assistance on this task. 

Valdes-Perez (1995) developed Mechem with this end in mind. The sys- 
tem accepts as input the reactants and products for a chemical reaction, along 
with other experimental evidence and considerable background knowledge about 
the domain of catalytic chemistry. Mechem lets the user specify interactively 
which of these constraints to incorporate when generating pathways, giving him 
control over its global behavior. The system carries out a search through the 
space of reaction pathways, generating the elementary steps from scratch using 
special graph algorithms. Search always proceeds from simpler pathways (fewer 
substances and steps) to more complex ones. Mechem uses its constraints to 
eliminate pathways that are not viable and also to identify any intermediate 
products it hypothesizes in the process. The final output is a comprehensive set 
of simplest pathways that explain the evidence and that are consistent with the 
background knowledge. 

This approach has produced a number of novel reaction pathways that have 
appeared in the chemical literature. For example, Valdes-Perez (1994) reports 
a new explanation for the catalytic reaction ethane + H 2 2 methane, which 
chemists had viewed as largely solved, whereas Zeigarniket al. (1997) present an- 
other novel result on acrylic acid. Bruk et al. (1998) describe a third application 
of Mechem that produced 41 novel pathways, which prompted experimental 
studies that reduced this to a small set consistent with the new data. The hu- 
man’s role in this process is explicit, with users formulating the problem through 
stating the reaction of interest and manipulating the algorithm’s behavior by in- 
voking domain constraints. Because Mechem produces pathways in a notation 
familiar to chemists, its outputs require little interpretation. 
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4.8 Other Computational Aids for Scientific Research 

We liave focused on tlie examples above because they cover a broad range of 
scientific problems and illustrate the importance of human interaction with the 
discovery system, but they do not exhaust the list of successful applications. For 
example, Pericliev and Valdes-Perez (in press) have used their Kinship program 
to generate minimal sets of features that distinguish kinship terms, like son and 
uncle, given genealogical and matrimonial relations that hold for each. They 
have applied their system to characterize kinship terms in both English and 
Bulgarian, and the results have found acceptance in anthropological linguistics 
because they are stated in that field’s conventional notation. 

There has also been extensive work in molecular biology, where one major 
goal is to predict the qualitative structure of proteins from their nucleotide se- 
quence, as Fayyad, Haussler, and Stolorz (1996) briefly review. This work has led 
to many publications in the biology and biochemistry literature, but we have cho- 
sen not to focus on it here. One reason is that most studies emphasize predictive 
accuracy, with low priority given to expressing the predictors in some common 
scientific notation. More important, many researchers have become concerned 
less with discovering new knowledge than with showing that their predictors 
give slight improvements in aceuracy over other methods. A similar trend has 
occurred in work on learning structure-activity relations in biochemistry, and we 
prefer not to label such efforts as computational scientific discovery. 

We also distinguish computer-aided scientific discovery from the equally chal- 
lenging, but quite different, use of machine learning to aid scientific data analysis. 
Fayyad et al. (1996) review some impressive examples of the latter approach in 
astronomy (classifying stars and galaxies in sky photographs), planetology (rec- 
ognizing volcanoes on Venus), and molecular biology (detecting genes in DNA 
sequences). But these efforts invoke induction primarily to automate tedious 
recognition tasks in support of cataloging and statistical analysis, rather than 
to discover new knowledge that holds scientific interest in its own right. Thus, 
we have not included them in our examples of computer-aided discovery. 

5 Progress and Prospects 

As the above examples show, work in computational scientific discovery no longer 
focuses solely on historical models, but also contributes novel knowledge to a 
range of scientific disciplines. To date, such applications remain the exception 
rather than the rule, but the breadth of successful computer-aided discoveries 
provides convincing evidence that these methods have great potential for aiding 
the scientific process. The clear influence of humans in each of these applications 
does not diminish the equally important contribution of the discovery system; 
each has a role to play in a complex and challenging endeavor. 

One recurring theme in applied discovery work has been the difficulty in find- 
ing collaborators from the relevant scientific field. Presumably, many scientists 
are satisfied with their existing methods and see little advantage to moving be- 
yond the statistical aids they currently use. This attitude seems less common in 
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fields like molecular biology, which have taken the computational metaphor to 
heart, but often there are social obstacles to overcome. The obvious response is 
to emphasize that we do not intend our eomputational tools to replace scientists 
but rather to aid them, just as simpler software already aids them in carrying 
out statistical analyses. 

However, making this argument convincing will require some changes in our 
systems to better reflect the position. As noted, existing discovery software al- 
ready supports intervention by humans in a variety of ways, from initial problem 
formulation to final interpretation. But in most cases this activity happens in 
spite of the software design rather than because the developer intended it. If 
we want to encourage synergy between human and artificial scientists, then we 
must modify our diseovery systems to support their interaction more direetly. 
This means we must install interfaces with explicit hooks that let users state 
or revise their problem formulation and representational choices, manipulate 
the data and system parameters, and recast outputs in understandable terms. 
The Mechem and Daviccand systems already include such facilities and thus 
constitute good role models, but we need more efforts along these lines. 

Naturally, explicit inclusion of users in the computational discovery process 
raises a host of issues that are absent from the autonomous approach. These 
include questions about which decisions should be automated and which placed 
under human control, the granularity at which interaction should occur, and 
the type of interface that is best suited to a particular scientific domain. The 
discipline of human-computer interaction regularly addresses such matters, and 
though its lessons and design criteria have not yet been applied to computer- 
aided discovery, many of them should carry over directly from other domains. 
Interactive discovery systems also pose challenges in evaluation, since human 
variability makes experimentation more difficult than for autonomous systems. 
Yet experimental studies are not essential if one’s main goal is to develop com- 
putational tools that aid users in discovering new scientific knowledge. 

Clearly, we are only beginning to develop effective ways to combine the 
strengths of human cognition with those of computational discovery systems. But 
even our initial efforts have produced some convincing examples of computer- 
aided discovery that have led to publications in the scientific literature. We 
predict that, as more developers realize the need to provide explicit support for 
human intervention, we will see even more productive systems and even more 
impressive discoveries that advance the state of scientific knowledge. 
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Abstract. We address the problem of computing various types of ex- 
pressive tests for decision tress and regression trees. Using expressive 
tests is promising, because it may improve the prediction accuracy of 
trees. The drawback is that computing an optimal test could be costly. 
We present a unified framework to approach this problem, and we revisit 
the design of efficient algorithms for computing important special cases. 
We also prove that it is intractable to compute an optimal conjunction 
or disjunction. 



1 Introduction 

A decision (resp. regression) tree is a rooted binary tree structure for predicting 
the categorical (numeric) values of the objective attribute. Each internal node 
has a test on conditional attributes that split data into two classes. A record 
is recursively tested at internal nodes and eventually reaches a leaf node. A 
good decision (resp. regression) tree has the property that almost all the records 
arriving at every node take a single categorical value (a numeric value close to 
the average) of the objective attribute with a high probability, and hence the 
single value (the average) could be a good predictor of the objective attribute. 

Making decision trees [9,11,10] and regression trees [2] has been a traditional 
research topic in the field of machine learning and artificial intelligence. Re- 
cently the efficient construction of decision trees and regression trees from large 
databases has been addressed and well studied among the KDD community. 
Computing tests at internal nodes is the most time-consuming step of construct- 
ing decision trees and regression trees. In the literature, there have been used 
simple tests that check if the value of an attribute is equal to (or less than) a 
specific value. 

Using more expressive tests is promising in the sense that it may reduce the 
size of decision or regression trees while it can retain higher prediction accu- 
racy [3,8]. The drawback however is that the use of expressive tests could be 
costly. We consider the following three types of expressive tests for partitioning 
data into two classes; 1) subsets of categorical values for categorical attributes, 
2) ranges and regions for numeric attributes, and 3) conjunctions and disjunc- 
tions of tests. We present a unified framework for handling those problems. We 
then reconstruct efficient algorithms for the former two problems, and we prove 
the intractability of the third problem. 



S. Arikawa and H. Motoda (Eds.): DS’98, LNAI 1532, pp. 40-57, 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 
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2 Preliminaries 

2.1 Relation Scheme, Attribute and Relation 

Let TZ denote a relation scheme, which is a set of categorical or numeric at- 
tributes. The domain of a categorical attribute is a set of unordered distinct 
values, while the domain of a numeric attribute is real numbers or integers. 
We select a Boolean or numeric attribute A as special and call it the objective 
attribute. We call the other attributes in TZ. conditional attributes. 

Let B be an attribute in relation scheme TZ. Let t denote a record (tuple) 
over TZ, and let t[B] be the value for attribute B. A set of records over TZ is 
called a relation over TZ. 

2.2 Tests on Conditional Attributes 

We will consider several types of tests for records in a database. Let B denote 
an attribute, and let v and Vi be values in the domain oi B. B — v is a. simple 
test, and t meets B — v ii t[B] — v. 

When R is a categorical attribute, let {vi, . . . , Vk} be a subset of values in the 
domain of B. Then, B G {ui, . . . , Vk} is a test, and t satisfies this test if t[B] is 
equal to one value in {ui, . . . ,Vk}. We will call a test of the form B G {wi, . . . ,Vk} 
a test with a subset of categorical values. 

When R is a numeric attribute, B = v, B < v, B > v, and ui < R < t>2(R G 
['<^1 ) '-’2] ) are tests, and a record t meets them respectively if t[B] = v, t[B] < v, 
t[B] > V, and Vi < t[B] < V2. We will call a test of the form R G [vi,V2] a test 
with a range. 

The negation of a test T is denoted by -AT . A record t meets ~^T if t does 
not satisfy T. The negation of -^T is T. 

A conjunction (a disjunction, respectively) of tests Ti,T2, . . . ,Tk is of the 
form Ti A T2 A . . . A Tfc (Ti V T2 V ... V T^). A record t meets a conjunction 
(respectively, a disjunction) of tests, if t satisfies all the tests (some of the tests). 

2.3 Splitting Criteria for Boolean Objective Attribute 

Splitting Relation in Two Let R be a set of records over TZ, and let \R\ denote 
the number of records in R. Let Test be a test on conditional attributes. Let R\ 
be the set of records that meet Test, while let R2 denote R — R\. In this way, we 
can use Test to divide R into R\ and i?2- Suppose that the objective attribute A 
is Boolean. We call a record whose A’s value is true a positive record with respect 
to the objective attribute A. Let R* denote the set of positive records in R. One 
the other hand, we call a record whose A’s value is false a negative record, and 
let R^ denote the set of negative records in R. The following diagram illustrates 
how R is partitioned. 

R = R* U R/ 

/ \ 



Ri = R\ U R( 



R2 — R2 U R2 
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The splitting by Test is effective for characterizing the objective Boolean 
attribute A if the probability of positive records changes dramatically after 
the division of R into Ri and R 2 ; for instance, \R^\/\R\ <§; \R\\/\Ri\, and 
|i?*|/|i?| ^ |i? 2 l/l^ 2 |- On the other hand, the splitting by Test is most in- 
effective if the probability of positive records docs not change at all; that is, 

m\R\ = \R{\/\R,\ = \Ri\/\R2\. 

Measuring the Effectiveness of Splitting It is helpful to have a way of mea- 
suring the effectiveness of the splitting by a condition. To define the measure, we 
need to consider \R\,\R*\,\Rf\,\Ri\,\R\\,\R(\,\R 2 \,\R 2 \ and {R^l as parameters, 
which satisfy the following equations: 

\R\ = \R*\ + \R^\ \Ri\ = \R\\ + \R(\ \R*\ = \R\\ + \Rl\ 

|i?| = |i?i| + |i?2l |i?2| = |i?2l + 1 ^ 2 ! = l^fl + 1 ^ 2 ! 

Since R is given and fixed, we can assume that |/?|, \R^\, and \R-^\ are constants. 
Let n and m denote |i?| and \R^\ respectively, then \R^ = n — m. Furthermore, 
if we give the values of |i?i| and |i?||, for instance, the values of all the other 
variables are determined. Let x and y denote |i?i| and \R[\ respectively. Let 
4>{x,y) denote the measurement of the effectiveness of the splitting by condition 
Test. We now discuss some requirements that (j){x,y) is expected to have. 

We first assume that lower value of cf){x,y) indicates higher effectiveness of 
the splitting. It does not matter if we select the reverse order. The splitting by 
Test is most ineffective when |i?*|/|i?| = m/n = = y/x = |i?||/|i? 2 |, 

and hence (f>{x, y) should be maximum when yjx = m/n. 

Suppose that the probability of positive records in R\, y/x, is greater than 
that of positive records in R, m/n. Also suppose that if we divide R by another 
new test, the number of positive records in R\ increases by Z\ {0 < A < x — y), 
while |i?i| is the same. Then, the probability of positive records in Ri, (y + A)/x, 
becomes to be greater than y/x, and hence we want to claim that the splitting 
by the new test is more effective. Thus we expect 4>{x, y + A) < 4>{x, y). Similarly, 
since y/x < y/{x — A) for 0<Z\<x — ywe also expect(^(i — A, y) < 4>{x, y). 
Figure 1 illustrates points {x,y), {x,y + A), and {x — A,y). 

If the probability of positive records in Ri, y/x, is less than the average m/n, 
then (x, y) is in the lower side of the line connecting the origin and (m,n). See 
Figure 1. In this case observe that the probability of positive records in R 2 , which 
is (m — x)/(n — x), is greater than m/n. Suppose that the number of positive 
records in R\ according to the new test decreases by Z\ {Q < A < x ~ y), while 
|/?i| is unchanged. Then, the number of positive records in R 2 increases by A 
while |i? 2 | is the same. Thus the splitting by the new test is more effective, and 
we expect 4>{x, y — A) < 4>{x, y). Similarly we also want to require 4>{x + A, y) < 
c/{x,y). 



Entropy of Splitting We present an instance of (/>{x,y) that meets all the 
requirements discussed so far. Let ent{p) = —phip — {I — p) ln(l — p), where p 
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Fig. 1. (x, y), (x, y + Z\), (x - Z\, y), (x, y - A), and (x + Z\, y) 



means the probability of positive records in a set of records, while (1 — p) implies 
the probability of negative records. Define the entropy Ent{x, y) of the splitting 
by Test as follows: 

X ,y. n — X ,m — y . 

-ent{A) + ent( 

n X n n — X 

where ^ , respectively) is the probability of positive records in Ri (i? 2 ). This 

function is known as Quinlan’s entropy heuristic [9] , and it has been traditionally 
used as a criteria for evaluating the effectiveness of the division of a set of records. 
Ent{x,y) is an instance of <p{x,y). We use the following theorem to show that 
Ent{x,y) satisfies all the requirements on 4>{x,y). 

Theorem 1. Ent{x,y) is a concave function for x > y > 0; that is, for any 
(xi,j/i) and (x 2 ,j/ 2 ) in {{x,y) | x > y > 0} and any 0 < A < 1, 

XEnt{xi,yi) + (1 - \)Ent{x 2 ,y 2 ) < Ent{X{xi,yi) + (1 - A)(x 2 ,y 2 ))- 

Ent{x,y) is maximum when y/x = m/n. 

Proof See Appendix. □ 

We immediately obtain the following corollary. 

Corollary 1. Let {x^,yf) be an arbitrary dividing point of {xi,yi) and {x 2 ,y 2 ) 
in {{x, y) I X > y > 0}. Then, mm{Erit{xi,yi), Erit{x 2 , y 2 )) < Erit{xi , yi) . 

For any (x, y) such that y/x > m/n and any 0 < Z\ < x — y, from the above 
corollary we have 

min(i?nt(x, y + A), Ent{x, {m/n)x)) < Ent[x,y), 

because (x, y) is a dividing point of (x,y + A) and (x, (m/n)x). Since 
Ent{x, (m/n)x) is maximum, 

Ent{x, y + A) < Ent{x, y). 
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and hence Ent{x, y) satisfies the requirement 4>{x^ y + A) < 4>{x, y). In the same 
way we can show that Ent{x, y) meets all the requirements on (p{x, y). 



2.4 Splitting Criteria for Numeric Objective Attribute 

Consider the case when the objective attribute A is numeric. Let denote 
the average of A’s values in relation R] that is, ii{R) = Let R\ 

denote again the set of records that meet a test on conditional attributes, while 
let R 2 denote R — R\. 

In order to characterize A, it is useful to find a test such that is 

considerably higher than /r(i?) while /r(i?2) is substantially lower than y,{R) 
simultaneously. To realize this criteria, we use the interclass variance of the 
splitting by the test: 

\Ri\{ti{Ri) - ia{R)f + \R 2 MR 2 ) - lx{R)f. 

A test is more interesting if the interclass variance of the splitting by the test is 
larger. We also expect that the variance of A’s values in Ri (resp., R2) should 
be small, which lets us approximate A's values in Ri {R2) at (/r(i?2))- To 

measure this property, we employ the intraclass variance of the splitting by the 
test: 

T.t€Ri + Ete/t2 

|i?| 

We are interested in a test that maximizes the interclass variance and also mini- 
mizes the intraclass variance at the same time. Actually the maximization of the 
interclass variance coincides with the minimization of the intraclass variance. 

Theorem 2. Given a set of tests on conditional attributes, the test that maxi- 
mizes the interclass variance also minimizes the intraclass variance. 



Proof. See Appendix. □ 

In what follows, we will focus on the maximization of the interclass variance. 
When R is given and fixed, \R\{— |ffi| + |f?2|) and regarded 

as constants, and let n and rn denote |i?| and respectively. If we 

denote |i?i| and by x and y, the interclass variance is determined 

by X and y as follows: 



x{A - -Y + {n - x){ — 
X n n 



which will be denoted by Var(x,y). We then have the following property of 
Var{x,y), which is similar to Theorem 1 for the entropy function. 

Theorem 3. Var{x,y) is a convex function for 0 < x < n; that is, for any 
{xi, yi) and [x 2 ,y 2 ) such that n > xi,X 2 >0 and any 0 < A < 1, 



XVar{xi,yi) + (1 - X)Var{x 2 ,y 2 ) > Var{X{xi,yi) + (1 - A)(x 2 ,y 2 ))- 



Var{x,y) is minimum when yjx = mjn. 
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Proof. See Appendix. □ 

Corollary 2. If (xs^ys) be an arbitrary dividing point of (xi^yi) and (x2,t/2) 
such that n > xi,X2 >0, then ma.x{Var{xi ,yi), Var{x2, 2/2)) > Var{xs, 2/3). 

Since the interclass variance has the property similar to the entropy function, 
in the following sections, we will present how to compute the optimal test that 
minimizes the entropy, but all arguments directly carry over to the case of finding 
the test maximizing the interclass variance. 



2.5 Positive Tests and Negative Tests 

Let i? be a given relation, and let Ri be the set of records in R that meet a 
given test. If the objective attribute A is Boolean, we treat “true” and “false” as 
numbers “1” and “0” respectively. We call the test positive if the average of A’s 
values in Ri is greater than or equal to the average of A’s values in R; that 
is, (Stg/ji I — (Stgfi Otherwise the test is called negative. 

Thus, when A is Boolean, the probability of positive records in Ri is greater 
than or equal to the probability of positive records in R. 

The test that minimizes the entropy could be either positive or negative. In 
what follows, we will focus on computing the positive test that minimizes the 
entropy of the splitting by the positive test among all the positive tests. This 
is because the algorithm for computing the optimal positive test can be used 
to calculate the optimal negative test by exchanging “true” and “false” of the 
objective Boolean attribute value (or reversing the order of the objective numeric 
attribute value) in each record. 

3 Computing Optimal Tests with Subsets of Categorical 
Values 

Let C be a conditional categorical attribute, and let {ci, C 2 , . . . , Cfc} be the do- 
main of C. Among all the positive tests of the form C ^ S where 6* is a subset 
of {ci, C 2 , . . . , Ck}, we want to compute the positive test that minimizes the en- 
tropy of the splitting. A naive solution would consider all the possible subsets of 
{ci, C 2 , . . . , Cfc} and select the one that minimizes the entropy. Instead of inves- 
tigating all 2^ subsets, there is an efficient way of checking only k subsets. 

We first treat “true” and “false” as real numbers “1” and “0” respectively. 
For each Ci, let denote the average of A’s values of all the records whose C’s 
values are cp, that is, 

_ Y.t[C]=a 

\{t\t[C]=c,}\- 

Without loss of generality we can assume that fix > P 2 > • • • > Mfc, otherwise 
we rename the categorical values appropriately to meet the above property. We 
then have the following theorem. 
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Theorem 4. Among all the positive tests with subsets of categorical values, 
there exists a positive test of the form C E {ci | 1 < i < j} that minimizes 
the entropy of the splitting. 

This theorem is due to Breiman et al.[2]. Thanks to this theorem, we only need 
to consider k tests of the form C E {cj | 1 < J < j} to hnd the optimal test. We 
now prove the theorem by using techniques introduced in the previous section. 

Proof of Theorem 4 We will prove the case of the minimization of the entropy. 
The case of maximization of the interclass variance can be shown similarly. 

Since > ji 2 > ■ ■ ■ ^ jXk, for 1 < j < fc, C S {c^ | 1 < * < j} is a positive 
test. Assume that among positive tests with subsets of categorical values, there 
does not exist any 1 < j < fc such that C E {ci \ 1 < i < j} minimizes the 
entropy, which we will contradict in what follows. Then, there exists 1 < h < k 
such that test C E {ci \ 1 < i < h — 1} UV, where P is a non-empty subset of 
{ci|h<*<fc}, is positive and minimizes the entropy, {ci \ 1 < i < h — If EV 
contains consecutive values from Ci to Ch-i but lacks Ch- Let h be the minimum 
number that satisfies this property. 

With each subset W of {ci, C2, . . . , Cfc}, we associate 

p{w) = {\{t\t[c]EW}\, 

t[C]ew 

in the Euclidean plane. Ent{p{W)) is the entropy of the splitting by the test 
C E W. The slope of the line between the origin and p{W) gives the average 
value of the objective attribute A among records in {t \ t[C] £ W}. We will 
denote the average by pw', namely, 

St[c]ew 

^ |{f I t[C] E W}\- 

Let {ci,. . .,Ch-i} denote {cj|l<j< h—1}. Figure 2 shows p{{c\, . . . ,Ch-i}), 
p({ci, . . . , Ch-i} U V), and p{{ci, Ch-i,Ch} U V). Since p({ci, . . . , Ch-i} U V) 
is associated with a positive test, it lies in the upper side of the line between 
the origin and p({ci, C 2 , . . . , Cfc}). P{c^,...,ch.-i} > IHch} ^ Mv, because pi > 
P 2 > ■ ■ ■ > Pk, and E is a subset oi {c^ \ h < i < k}. It is easy to see that 
p({ci, . . . and p{{ci, . . . ,Ch-i,Ch} U V) are also in the upper side of the 

line between the origin and p{{ci, C 2 , . . . , Cfc}). 

Suppose that p{{ci , . . . , Ch-i, Ck} U V) is in the upper side of the line passing 
through the origin and p{{ci , . . . ,Ch-i} U V). The left in Figure 3 shows this 
situation. Suppose that the line passing through p{{c \, . . . , Ch-i, Ch} U V) and 
p({ci, . . . , U V) hits the line between the origin and p{{ci , . . . , c^}) at Q. 

From the concavity of the entropy function 

mm{Ent{p{{ci, . . .,Ch-i,Ck} U V)),Ent{Q)) < Ent{p{{ci,. . . U V)). 

Since Ent{Q) is maximum, 

Ent{p{{ci, Ch-i,Ch} U V)) < Ent{p{{ci, cu-i} U V)), 
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Fig. 2. p({ci, . . . , Cf,_i}), p({ci, . . . , Ch-i} U V), and p({ci, . . . , Ch-i,Ch} U V) 
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Fig. 3. The left(resp. right) figure illustrates the case whenp({ci,. . .,Ch~i,Ck}UV) 
is in the upper(lower) side of the line connecting the origin andp({ci,. . .,Ch_i}UF). 



which contradicts the choice of h. 

We now consider the opposite case when p({ci, . . . , Ch-i, Ch} U V) is in the 
lower side of the line connecting the origin and p({ci, . . . ,Ch-i} U V). See the 
right in Figure 3. Suppose that the line passing through p({ci, . . . , Ch-i, Ch}UV) 
and p({ci, . . . , c/i_i} U V) hits the line between the origin and p({ci, . . . , Ch-i}) 
at Q. If Ent(p{{ci , . . . , Ch-i} UF)) < Ent{Q), from the concavity of the entropy 
function, we have 

Ent{p{{ci, . . .,Ch-i,Ch} U F)) < Ent{p{{ci, . . -,Ch-i} U F)), 

which contradicts the choice of h. If Ent{Q) < Ent{p{{ci, . . . , Ch-i} U F)), we 
have 

Ent{p{{ci, . . -,Ch-i})) < Ent{Q), 

because Ent{0, 0) is maximum, and Ent{x, y) is a concave function. Thus we 
have 

Ent{p{{ci, Ch-i})) < Ent{p{{ci , . . . , cu-i} U F)), 
which again contradicts the choice oi h. □ 
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Fig. 4. An x-monotone region (left) and a rectilinear convex region (right) 



4 Computing Optimal Tests with Ranges or Regions 

Let -B be a conditional attribute that is numeric, and let / be a range of the 
domain of B. We are interested in hnding a test of the form B (E I that minimizes 
the entropy (or maximizes the interclass variance) of the splitting by the test. 
When the domain of B is real numbers, the number of candidates could be 
infinite. One way to cope with this problem is that we discretize this problem 
by dividing the domain of B into disjoint sub-ranges, say 7i,...,/jv, so that 
the union 7i U . . . U Jjv is the domain of B. The division of the domain, for 
instance, can be done by distributing the values of B in the given set of records 
into equal-sized sub-ranges. We then concatenate some successive sub-ranges, 
say li, li+i, ... ,Ij, to create a range 7^ U7j_|_i U . . . U 7j that optimizes the criteria 
of interest. 

It is natural to consider the two-dimensional version. Let B and C be numeric 
conditional attributes. We also simplify this problem by dividing the domain 
of B (resp. C) into (Nc) equal-sized sub-ranges. We assume that — 
Nq = N without loss of generality as regards our algorithms. We then di- 
vide the Euclidean plane associated with B and C into N x N pixels. A grid 
region is a set of pixels, and let R be an instance. A record t satisfies test 
{B,C) e R it {t[B],t[C]) belongs to R. We can consider various types of grid 
regions for the purpose of splitting a relation in two. In the literature two classes 
of regions have been well studies [4,3,12,8]. An x-monotone region is a connected 
grid region whose intersection with any vertical line is undivided. A rectilinear 
convex region is an x-monotone region whose intersection with any horizontal 
line is also undivided. Figure 4 shows an x-monotone region in the left and a 
rectilinear convex region in the right. 

In the case of computing the optimal range by concatenating some consecu- 
tive sub-ranges of N sub-ranges, we may consider 0{N^) sequences of successive 
sub-ranges, but to this end, Katoh [6] presents an 0(A^ log A^)-time algorithm. 

On the other hand, the number of x-monotone regions and the number of 
rectilinear convex regions is more than 2^. It is non-trivial to efficiently find 
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Fig. 5. The left figure presents the convex hull of stamp points. The middle 
illustrates P, Q\, Q 2 and Q 3 in Proposition 1. The right shows the hand probing 
technique. 



such a region R that minimizes the entropy (maximizes the interclass variance) 
of the splitting by the test {B, C) C R. Here we review some techniques for this 
purpose. 

Convex Hull of Stamp Points Let TZ denote the family of x-monotone re- 
gions or the family of rectilinear convex regions. Let A be the objective at- 
tribute. When A is Boolean, we treat “true” and “false” as real numbers “1” 
and “0” . With each region R in TZ, we associate a stamp point (x, y) where 
X = \{t\t meets (B,C) e i?}| and y = J2{t\t meets {B,c)eR} 
ber of regions in TZ is more than 2^, we cannot afford to calculate all the point 
associated; and hence we simply assume their existence. 

Let S denote the set of stamp points for a family of regions TZ. A convex 
polygon of S has the property that any line connecting arbitrary two points of S 
must itself lies entirely inside the polygon. The convex hull of S is the smallest 
convex polygon of S. The left in Figure 5 illustrates the convex hull. The upper 
(lower) half of a convex hull is called the upper (lower) hull, in short. 

Proposition 1. Let R ^ TZ he the region such that test (B,C) G R minimizes 
the entropy (or maximizes the interclass variance). The stamp point associated 
with R must be on the convex hull of S . 

Proof. Otherwise there exists such a point P inside the convex hull of S that 
minimizes the entropy. Select any point Qi on the convex hull, draw the line 
connecting P and Q\, and let Q 2 be another point where the line between P 
and Qi crosses the convex hull. From the concavity of the entropy function, 
m.m(Ent{Qi), Ent(Q 2 )) < Erit(P), and there exists a point Q 3 on the convex 
hull such that Ent(Q 2 ,) < Ent(Q 2 ) (see Figure 5). Thus, Ent(Q 2 ) < Ent(P), 
which is a contradiction. □ 

If T is the positive (negative, resp.) test that minimizes the entropy among 
all the positive tests of the form (B, C) G R, from Proposition 1 the stamp point 
associated with T must be on the upper (lower) hull. We then present how to 
scan the upper hull to search the stamp point that minimizes the entropy. 
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Hand-Probing To this end it is useful to use the “hand-probing” technique that 
was invented by Asano, Chen, Katoh and Tokuyama [1] for image segmentation 
and was later modified by Fiikiida, Morimoto, Morishita and Tokuyama [4] for 
extraction of the optimal x-monotone region. 

For each stamp point on the upper hull, there exists a tangent line to the 
upper hull at the point. Let 6 denote the slope of the tangent line. The right 
picture in Figure 5 shows the tangent line. Note that the stamp point maximizes 
y — Ox among all the stamp points, and let R denote the region that corresponds 
to the stamp point. We now present a roadmap of how to construct R. 

Let Pij{l < i,j < N) denote the (i, j)-th pixel in x pixels. A grid region 
is a union of pixels. Let Uij be the number of records that meet {B, C) 
and let be the sum of the objective attribute values of all the records that 
satisfy {B, C) G pij, which is Using those notations, we 

can represent the stamp point associated with R by Ep cflTbi)! 

which maximizes y — Ox. Since 

pijQR pi,jQR Pi,jQR 

R maximizes J2p ^ 

We call Ujj — Ouij the gain of the pixel p^j. The problem of computing 
the region that maximizes the sum of gains of pixels in the region has been 
studied. For a family of x-monotone regions, Fukuda, Morimoto, Morishita, and 
Tokuyama presents an 0(A^)-time algorithm [4]. For a family of rectilinear 
convex regions, Yoda, Fukuda, Morimoto, Morishita, and Tokuyama gives an 
0(A^)-time algorithm [12]. Due to the space limitation, we do not introduce 
those algorithms. Those algorithms use the idea of dynamic programming, and 
they connect an interval in each column from lower index i to higher one to 
generate an x-monotone (or, rectilinear) region. 

Since we have an efficient algorithm for generating the region associated with 
the stamp point on the convex hull at which the line with a slope 0 touches, it 
remains to answer how many trials of hand-probing procedure are necessary to 
find the region that minimizes the entropy. If n is the number of given records, 
there could be at most n stamp points on the upper hull, and therefore we may 
have to do n trials of hand- probing by using n distinct slopes. Next we present 
a technique that is expected to reduce the number of trials to be O(logn) in 
practice. 



Guided Branch-and-Bound Search Using a tangent line with the slope 
0 = 0, wc can touch the rightmost point on the convex hull. Let a be an aribitrary 
large real number such that we can touch the leftmost point on the convex hull 
by using the tangent line with slope a. Thus using slopes in [0,a], we can scan 
all the points on the upper hull. We then perform the binary search on [0, a] to 
scan the convex hull. 

During the process we may dramatically reduce the search space. Figure 6 
shows the case when we use two tangent lines to touch two points P and Q on 
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Fig. 6. Guided Branch-and-Bound Search 



the convex hull, and R denotes the point of intersection of the two lines. Let X 
be an arbitrary point inside the triangle PQR. From the concavity of the entropy 
function, we immediately obtain 

imn{ Ent{P), Ent{Q), Ent[R)} < Ent{X). 

If mm{Ent{P) , Ent{Q)} < Ent{R), we have mm{Ent{P) , Ent{Q)} < Ent{X), 
which implies that it is useless to check whether or not there exists a point be- 
tween P and Q on the hull whose entropy is less than min{Ent{P) , Ent[Q)} . 
In practice, most of subintervals of slopes are expected to be pruned away dur- 
ing the binary search. This guided branch-and-bound search strategy has been 
experimentally evaluated [3,8]. According to experimental tests the number of 
trials of hand-probing procedure is O(logn). 

5 Computing Optimal Conjunctions and Disjunctions 

Suppose that we are given a set S of tests on conditional attributes. We also 
assume that S contains the negation of an arbitrary test in S. We call a con- 
junction positive {negative, resp.) if it is a positive (negative) test. We will show 
that it is NP-hard to compute the positive conjunction (the positive disjunction, 
resp.) that minimizes the entropy among all positive conjunctions (positive dis- 
junctions) of tests in S. Also, it is NP-hard to compute the positive conjunction 
(positive disjunction) that maximizes the interclass variance. 

Let Ti A . . . A Tfe be a positive conjunction of tests in S. Observe that the 
entropy (the intcrclass variance, resp.) of the splitting by Tj A . . . AT^ is equal to 
the entropy (the interclass variance) of the splitting by ^(Ti A ... A Tk). ^(Ti A 
. . . ATk) is equivalent to -^Tl V ... V ^Tfc, which is a negative disjunction of tests 
in S. Thus the negation of the optimal positive conjunction gives the optimal 
negative disjunction. As remarked in Subsection 2.5, computing a negative test 
can be done by using a way of computing a positive test, and therefore we will 
prove the intractability of computing the optimal disjunction. 

Theorem 5. Given a set S of tests on conditional attributes such that S con- 
tains the negation of any test in S, it is NP-hard to compute the positive dis- 
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Fig. 7. Each subset is extended with a unique white element. 



junction of tests in S that minimizes the entropy value among all positive dis- 
junctions. It is also NP-hard to compute the positive disjunction that maximizes 
the interclass variance. 

Proof. Here we present a proof for the case of the entropy. The case of the 
interclass variance can be proved in a similar manner. We reduce the difficulty 
of the problem to the NP-hardness of MINIMUM COVER [5]. Let U be a hnite 
set, and let C be a collection of subsets of V. A sub-collection C'(c C) is a cover 
of V if any clement in V belongs to one of C . Suppose that Cmin is a cover that 
minimizes the number of subsets in it. It is NP-hard to compute Cmin- 

Suppose that V contains a elements, and C contains c subsets of V . We call 
elements in V black. Let be a number greater than a and c, generate a set W of 
new b elements, and call them white. We then extend each subset in C by adding 
a unique white element that does not appear elsewhere. If 6 > c, — c white 
elements are not used for this extension. Figure 7 illustrates this operation. In 
the figure each hyperedge shows a subset in C. After this extension, C and 
become collections of subsets of U U W. 

In what follows, we treat elements in U U VP as records in a database. We 
assume that the objective attribute is true (false, resp.) for black (white) records 
in V U VP. We then identify each subset in C with a test such that all elements 
in the subset meets the test, while none of elements outside the subset satisfy 
the test. We also identify a collection C"(C C) with the disjunction of tests that 
correspond to subsets in C . We then show that the disjunction corresponds 
to Cmin minimizes the entropy, which means that finding the optimum disjunc- 
tion is NP-hard. 

With each sub-collection C'(C C) such that the disjunction identified with C 
is positive, we associate a point (x, y) in an Euclidean plane such that x is the 
number of records in C' , and y is the number of black records in C' . See Figure 8. 
Ent{x, y) gives the entropy of the disjunction identified with C . Let k denote the 
number of subsets in the minimum cover Cmin- {a + k, a) is associated with Cmin- 
We prove that all the points associated with collections of subsets of C fall in 
the gray region in Figure 8. 
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Fig. 8. Points Associated with Sub-collections 



All the points lie in the upper side or on the line connecting the origin and 
(a + 6, a), because each point corresponds to a positive disjunction. We show that 
all the points lie under or on the line between (a + k, a) and {a — k, a — k). To 
this end, it is enough to prove that any C C C that contains a — l{l < 1) black 
records must also have at least k — I white records. The proof is an induction 
on 1. and consider the case when 1 = 1. Suppose that the number of white records 
is less than A; — 1. Wc can immediately construct a cover of V by adding to C 
a subset X that contains the remaining black record. Note that the number of 
white records in C U {X} is less than k, which contradicts the choice of Cmin- 
The argument carries over to the case when ^ > 1. 

We then prove that Ent{a — k,a — k) > Ent{a + k, a) for fc > 1. 
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Let f{x) denote —k In — x In We then have Ent{a — k,a — k) = 

and Ent{a + k,a) = Since f'{x) = In > 0 for x > 0. Because 

b > a > 0, we have f(b) > f{a), and hence Ent{a — k,a — k) > Ent{a + fc, a). 

From Theorem 1, Ent{x, y) is maximum at any point (x, y) on the line be- 
tween (0, 0) and (a + 6, a), and Ent{x, y) is a concave function on the gray region 
of Figure 8. Let P be an arbitrary point in the gray region. 



If P is in the upper triangle, draw the line that passes through the origin 
and P, and suppose that the line hits the line connecting {a — k,a — k) and 
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{a+k, a) at Q. Since Ent(Q, 0) is maximum, from the concavity of the entropy 
function, Ent(P) > Ent{Q). Since Ent{a — k,a — k) > Ent{a + k, a), from 
the concavity of the entropy, Ent{Q) > Ent(a + fc,a), and hence Ent{P) > 
Ent{a + k,a). 

— If P is in the lower triangle, suppose that the line going through (a + k,a) 
and P hits the line between (0, 0) and {a+b, a) at Q. Ent{P) > Ent{a + k, a), 
because Ent{Q) is maximum and Ent{x, y) is a concave function. 

In both cases, the entropy of any point in the gray region is no less than the 
entropy of (a + fc, a). Recall that {a + k, a) corresponds to the positive disjunction 
associated with Cmin- Consequently the positive disjunction that minimizes the 
entropy corresponds to Cmin- C 

Design of Polynomial-Time Approximation Algorithm is Hard. We re- 
mark that it is hard to design a polynomial-time algorithm that approximates 
the optimal positive disjunction of tests. In the proof of Theorem 5, we reduce 
MINIMUM COVER to the problem of computing the optimal positive disjunc- 
tion. Since it is NP-hard to find Cmin, it is natural to further ask whether or 
not there exists an efficient way of approximating Cmin ■ To be more precise, we 
wish to design a polynomial-time approximation algorithm that finds a cover 
of V such that the number I of white points in it is close to the number k of 
the white points in Cmin', that is, the approximation ratio l/k is as small as 
possible. According to the result by Lund and Yannakakis [7], unless P=NP the 
best possible approximation ratio is 0(logn), where n is the number of points 
in a maximum-size subset. 
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Appendix 

Proof of Theorem 1 

We first prove the concavity of Ent{x,y). For any x>y>0 and any real num- 
bers and 62, let V denote 5\x + 621/. It suffices to prove d^Ent{x, y)jdV'^ < 0. 
Recall that 

^ , X ,y . n — x ,m — y. 

Ent{x,y) = —enti—) H enti ). 

n X n n—x 

Define f{x,y) = ^ent{^). Then, 

Ent{x,y) = f{x,y) + f{n-x,m-y) (1) 

To prove d^Ent{x,y)/dV^ < 0, it is sufficient to show the following inequalities: 

d^f{x, y)fdV^ < 0 d^f{n — x,m — y^jdV^ < 0. 

Here we will prove the former inequality. The latter ean be proved in a similar 
way. When 81,62 0, we have: 

f(x,y) = -(-t/log- - (x-y) log(l - -)) 

n X X 

df{x,y) 

dV 

_ df{x,y) dx df{x,y) dy 
dx dV^ dy dV 

= -(-log(a: - y) + loga:)^ + -(- logy + log(x - y))^ 
n di n 02 
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d^f{x,y) 

dV^ 

_ i ,, i .]_ 1 1.1 1, 1 1 ,1 

n X — y~^x6i^x — yS'zSi^nx — ySi~^ y x — y 62 62 

^ ~ ndf6'ix{x - y)y^^^^ ~ 

< 0 (x > y > 0) 



When di = 0, 62 ^ 0, V = 52 y, and we have: 

df{x,y) 1 1 

^^ = -(-logy + log(x-y))- 

d’^f{x,y) _ 1/_1 _ 1 X 1 _ 1 -3^ 1 

dV^ y x — y '62 n {x — y)y 62 



(x > y > 0) 



The case when 5i ^ 0 and ^2 = 0 can be handled in a similar way. 

Next we prove that Ent{x,y) is maximum when y/x = rn/n. From Equa- 
tion (1), observe that for any 0 < y < x, Ent{x.y) = Entiji — x,m — y). 
According to the concavity of Ent{x, y), we have 



1 1 

Ent{x, y) = —Ent[x, y) + —Ent{n — x^m — y) 

^ ,x + n — x y + m — y. 

= Eni(-,-), 

which means that Ent{x,y) is maximum when (x, y) = (^, Finally we can 
prove that Ent{x,y) is constant on y = {mln)x, because 

Ent{x, y) = -( log 1 log(l -h 

n n n n n 

n — X . m . m . m , . . m . . 

( log (1 ) log(l )) 

n ri ri n n 

= log (1 log(l . 

n n n n 

Since (^, is on y = {mjn)x, Ent{x,y) is maximum when y/x = m/n. □ 



Proof of Theorem 2 

The interclass variance can be transformed as follows: 

\RiMRi) - ii{R)f + \R2MR2) - y{R)? 

= -\RHRf + + \R2\y{R2?), 

because \R\ = |i?i| + |i? 2 | and + |f? 2 |At(-R 2 ) = IRly^iR)- Since \R\ 

and /i(f?) are constants, the maximization of the interclass variance is equivalent 
to the maximization of |i?i |/x(i?i)^ + \R 2 \y{R 2 )^ ■ 
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On the other hand, the intraclass variance can be transformed as follows: 

Etgjti (^[^] - + J2teR2 ~ 

\R\ 

+ \R2HR2r) 

|i?| 

because EteRi — I-^i|a*(-Ri) and EteR 2 ~ l-^ 2 |yu(-R 2 )- Since R is fixed, 
Ei^rAM^ is a constant. Thus the minimization of the intraclass variance is 
equivalent to the maximization of |i?i + |i ?2 |m(.R 2 )^ tliat is also equivalent 

to the maximization of the interclass variance. □ 



Proof of Theorem 3 



The proof is similar to the proof of Theorem 1. We first prove that Var{x, y) is 
a convex function for 0 < x < n. For any 0 < x < n and any <5i and ^ 2 , let V 
denote <5ix + 52y- We will the case when 5\,52 / 0. It is sufficient to prove that 
d'^Var{x,y)/dV'^ > 0. Recall that 



/ N ,V nr.n , . ,m — y m., 

Ror(x, y) = x( + (n — x)( ) . 

X n n — X n 

Define <7(.x,y)=.x(|-f)2. Then, 

Var{x, y) = g{x, y) + g{n - x, m - y). (2) 

We prove d^Var{x,y) / dV^ > 0 by showing the following two inequalities: 

d^g{x, y)ldV^ > 0 d^g{n — x, m — y)jdV^ > 0 . 



We prove the former case. The latter can be shown in a similar manner. 



dg{x, y) ^ dg{x,y) dx dg{x,y) dy ^ j_ r /^',2 _ n 2 -, , Ay. _ I!!) 

dV ^ dx dV dy dV ^ n' ^x’ ^ 52^x 

d’^g{x,y) ^ 2 _y_ _ E \2 ^ n 
dV^ X 5 \x 82 ~ 

Next we prove that Var(x,y) is minimum when y/x = mjn. From Equa- 
tion (2), we have Var{x, y) = Var{n — x, rri — y). From the convexity of Rar(x, y), 



Var{x, y) = —Var{x, y) -\ — V ar{n — x,m — y) 

x + n-x y + ni-y 
> V ar{ , ) 



n m 

= Var(-,-), 



which implies that Var{x, y) is minimum when (x, y) — (j, ^). It is easy to see 
that Var{x,y) = 0 when y/x = m/n. Since (^, -y) is on y/x = m/n, Var{x,y) 
is minimum when y/x = rn./n. □ 
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1 Extended Abstract 

The pharmaceutical industry is increasingly overwhelmed by large- volume-data. 
This is generated both internally as a side-effect of screening tests and combina- 
torial chemistry, as well as externally from sources such as the human genome 
project. The industry is predominantly knowledge-driven. For instance, knowl- 
edge is required within computational chemistry for pharmacophore identifica- 
tion, as well as for determining biological function using sequence analysis. 

From a computer science point of view, the knowledge requirements within 
the industry give higher emphasis to “knowing that” (declarative or descriptive 
knowledge) rather than “knowing how” (procedural or prescriptive knowledge). 
Mathematical logic has always been the preferred representation for declarative 
knowledge and thus knowledge discovery techniques are required which generate 
logical formulae from data. Inductive Logic Programming (ILP) [6,1] provides 
such an approach. 

This talk will review the results of the last few years’ academic pilot studies 
involving the application of ILP to the prediction of protein secondary struc- 
ture [5,8,9], mutagenicity [4,7], structure activity [-3], pharmacophore discov- 
ery [2] and protein fold analysis [10]. While predictive accuracy is the central 
performance measure of data analytical techniques which generate procedural 
knowledge (neural nets, decision trees, etc.), the performance of an ILP system 
is determined both by accuracy and degree of stereo-chemical insight provided. 
ILP hypotheses can be easily stated in English and exemplified diagrammatically. 
This allows cross-checking with the relevant biological and chemical literature. 
Most importantly it allows for expert involvement in human background knowl- 
edge refinement and for final dissemination of discoveries to the wider scientific 
community. In several of the comparative trials presented ILP systems provided 
significant chemical and biological insights where other data analysis techniques 
did not. 

In his statement of the importance of this line of research to the Royal Soci- 
ety [8] Sternberg emphasised the aspect of joint human-computer collaboration 
in scientific discoveries. Science is an activity of human societies. It is our belief 
that computer-based scientific discovery must support strong integration into 
existing the social environment of human scientific communities. The discovered 
knowledge must add to and build on existing science. The author believes that 
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the ability to incorporate background knowledge and re-use learned knowledge 
together with the comprehensibility of the hypotheses, have marked out ILP as 
a particularly effective approach for scientific knowledge discovery. 
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Abstract. In machine learning, it is important to reduce computational 
time to analyze learning algorithms. Some researchers have attempted 
to understand learning algorithms by experimenting them on a variety 
of domains. Others have presented theoretical methods of learning algo- 
rithm by using approximately mathematical model. The mathematical 
model has some deficiency that, if the model is too simplified, it may 
lose the essential behavior of the original algorithm. Furthermore, ex- 
perimental analyses are based only on informal analyses of the learning 
task, whereas theoretical analyses address the worst case. Therefore, the 
results of theoretical analyses are quite different from empirical results. 
In our framework, called random case analysis, we adopt the idea of 
randomized algorithms. By using random case analysis, it can predict 
various aspects of learning algorithm’s behavior, and require less compu- 
tational time than the other theoretical analyses. Furthermore, we can 
easily apply our framework to practical learning algorithms. 



1 Introduction 

The main objective of this paper is to understand the learning behavior of induc- 
tive learning algorithms under various conditions. To this end, many researchers 
have studied systematic experimentation on a variety of problems in order to find 
empirical regularities. For example, UCI machine learning database [9] is widely 
used for experimental analysis. Since some domains may contain noise and other 
domains may include many irrelevant attributes, the algorithm’s learning rate is 
affected by these invisible factors. Thus, experimental analysis leads to findings 
on the average case accuracy of an algorithm. 

Others have carried out theoretical analyses based on the paradigm of prob- 
ably approximately correct learning [4]. The PAG model has led to many im- 
portant insights about the capabilities of machine learning algorithms, the PAG 
model can only deal with too simplified version of the actual learning algorithms. 
Furthermore, the PAG model produces overly-conservative estimates of error and 
does not take advantage of information available in actual training set [3]. 

Average-case analysis [10] was proposed to unify the theoretical and empir- 
ical approaches to analyze the behavior of learning algorithms under various 
conditions, along with information about the domain. Average-case analysis can 
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explain empirical observations, predict average-case learning curves, and guide 
the development of new learning algorithms. The main drawback of average-case 
analysis is that it requires a detailed analysis of an individual algorithm to deter- 
mine the condition under which the algorithm change the hypothesis. Therefore 
researchers have studied on the simple learning algorithms (i.e., wholist [10], 
one-level decision trees [5], Bayesian classifiers [7] and nearest neighbor algo- 
rithm [8]). 

Pazzani [10] anticipated that the average-case analysis framework would scale 
to similar algorithms using more complex hypotheses, it would be possible, but 
computationally expensive, to model more complex learning algorithms with 
more expressive concepts. However, due to this complexity, only the limited 
number of training examples are considered by the average-case analysis. 

In our framework, called random case analysis, we adopt the idea of random- 
ized algorithms [6] [13]. By randomized algorithms we mean algorithms which 
make random choices in the course of their execution. As a result, even for a 
fixed input, different runs of a randomized algorithm may give different results. 
By using random case analysis, we can predict various aspects of learning algo- 
rithm’s behavior, requiring much less computational time than previous analyses. 
Furthermore, we can easily apply our framework to practical learning algorithms, 
such as IDS [II], C4.5 [12], CBL [1] and AQ [15]. 

2 Random Case Analysis 

2.1 Basic Idea 

The idea of randomized algorithms are applied in the fields of number the- 
ory, computational geometry, pattern matching and data structure maintenance. 
Many of the basic ideas of randomization were discovered and applied quite early 
in the field of computer science. For example, random sampling, random walk 
and randomized rounding can be used effectively in algorithm design. 

We now consider the use of random sampling for the problem of random 
case analysis. Random sampling is based on the idea of drawing random sam- 
ples from L (population) in order to determine the characteristics of L, thereby 
reducing the size of the problem. That is, random case analysis takes advantage 
of random sampling so as to evaluate specihe learning algorithm under various 
conditions even with the large number of training sets, thereby reducing the 
computational cost for evaluation of learning algorithms. 

Fig. 1 shows the simple framework of random case analysis. Note that N is the 
number of total trials. L is the original training set. AT is a performance measure 
such as performance accuracy. In Fig. 1, independent trials of random case anal- 
ysis are Bernoulli trials in the sense that successive trials are independent and 
at each trial the probability of appearance of a ‘successful’ classiheation remains 
constant. Then the distribution of success is given by the binomial distribution. 

Now consider the fact that the number of possible concept descriptions grows 
exponentially with the number of training examples and the number of at- 
tributes. Thus, it would be interesting to obtain a reasonable number of trails. 
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1. n <— 0, X ^ 0. 

2. Draw from L a random sample L of size 1. 

3. Apply the learning algorithm to the training examples L, and if 
the hypothesis can classify the test example correctly, X <— X + 1. 

4. n <— n + 1. 

5. If n > X then goto (6), if n < X then goto (2). 

6. K^X/N. 

Fig. 1. A simple framework of random case analysis. 



We are faced with a question in designing a random case analysis: how large 
must N be so that the approximate performance accuracy achieves a higher 
level of confidence. Tight answer to this question comes from a technique known 
as the Chernoff bound [13]. 

Theorem 1 Let Xi, 1 < i < n be independent Bernoulli trials such that 
Pr(Xj = 1) = Pi and Pr(Xj = 0) = 1 — pi. Let X = 
p = Yl!i=iPi > Then for a real num,her S G (0, 1], 

Pr(X < (1 — S)p) < exp(— /i(5^/2) F~{p,6) (1) 



Pr(X > (1 + 5)p) < 



exp(5) 

(l + 5)(i+^)_ 



cfef 






( 2 ) 



(1) indicates the probability that X is below (1 — 5)/i, whereas (2) shows the 
probability that X exceeds (1 + 5)p. These inequalities seek the bounds on the 
“tail probabilities” of the binomial distribution. 

Now consider random case analysis by using the Chernoff bound. Let Xi 
be 1 if the hypothesis correctly classifies the test example at i-th trial and 0 
otherwise. Let X = be the number of correct classification out of N 

trials. The performance accuracy is K — X/N and its expectation is pK — p/N . 
Then we can get the following theorem. 



Theorem 2 Let Xi, ■ ■ ■ ,Xm be independent Bernoulli trials with Pr(X,; = 1) = 
p^: p^ G (0,1). Let X = = Sill Pi > 0- Then for S G (0,1], 

Pr(AT < (1 -5)pk) < exp{-pKS^N/2) '^= F~{pk, 5) (3) 



Pr(X > (1 + d)pK) < 



exp(i5) 

(1 + 5) (1+^0 



1 U-kN 



^^^F+{pk,5) 



( 4 ) 



By using these inequalities (3) and (4), we will consider the minimal number of 
trials for a given 5. 
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2.2 The Minimal Number of Trials 

We assume that and S) are constant. As N increases, we can 

derive the following inequalities from (3) and (4). 

Pr(A: < (1 - 6)^lK) < exp{-fiKd‘^N/2) +0 (5) 



Pt{K> (l + 5)fiK) < 



ip{S) 



(1 + 



MX 



N-^oo 



+0 



( 6 ) 



From (5), we can say that K > {1 — 6)ij,k for any 6 a,s N increases. Furthermore, 
as 6 gets smaller, wc can obtain the inequality K < i^k- In the same way, 
from (6), as N increases, K < /j,k- From this observation, we can say that 
K = iiK if N is a large number. 

We now return to grappling with the simpliheation of (3) and (4) to obtain the 
minimal number of trials . First of all, we will dehne the function f^{F ~ , S) 
such that 



f~ {F~ , (5) determines the minimal number of trials where K > {1 — S)fiK- By 
using (3) and (7), we can obtain the minimal value 



^ -2 Inf inK,S) ^ f (f ,d) djf __ 
In the same way, we will dehne the function f~^{F'^ , 6) 



( 8 ) 



/+(f +, 5) log f+(/.K, 5) (9) 

(1 + 6)(1 + U 

/^(f 5) determines the minimal number of trials where AT > (1 + 5)^k- By 
using (4) and (9), wc can obtain the minimal value 



log Bxp(a) f~*~ r\ 

^ (i+6)g+-» ^ .0) drf 

^ ,, ,, mir. 

fJ-K IJ-K 



By using (8) and (10), we will dehne the function 



( 10 ) 



f{F+,F-,6) max{/+(f +, 5),r{F~,5)} (11) 

This function determines the minimal number of trials where (1 — 5)jiK < K < 
{l + S)fiK- The minimal value Nmin which satishes both (3) and (4) is as follows: 



Nrmn = max{iV+ 

^ f{F+,F-,5) 



max{/+(f+,<?),/-(f-,<^)} 

fJ'K 



( 12 ) 
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(12) means that if gets smaller the minimal number of trials exponentially 
increases. In order to avoid this problem, we will re-visit Theorem 2. 

Now we will define the misclassification rate Kmias = X/N where X = 
Yl!i=iXi- Let HKmiss be the expectation of misclassification rate. Suppose that 
the misclassification rate Kmiss achieves a high-level confidence, then we can 
obtain that K = 1 — Kmiss- Given this, the probability that the misclassification 
rate Kmiss is below (1 — 5)^ Kmiss IS as follows: 



Pr(N'm,.,s < (1 - S)fj.Kmiss) < exp(-(l - fj,K)S^N/2) (13) 

Furthermore, the probability that the misclassification rate exceeds {l+S)fiKmiss 
is derived from (4). 



P"^{Kmiss ^ (1 T S) jJ-Kmiss) ^ 



exp(5) 

(1 + 5)(1+^) 



(l-tlK)N 



(14) 



By using (7) and (9), we can transform (13) and (14) into (15) and (16) respec- 
tively. 



-2lnF-{^iK,S) ^ f-{F-,S) drf 
(1 - 1 — Mif ^ 



log_ exp(6) F+ 
_/V = (l + '5)t^ + '» 

1 - ^J-K 

Using (11), we arrive at 



f+{F+,d) 
1 - Mk 






miss min 



(15) 



(16) 






missmin 



AX, 



X- 



missmin^ missmin 



f{F+,F-,5) 



max{/+(f+,(f),/-(f-,(f)} 
1 - ^J'K 



(17) 



From (17), we can obtain the minimal number of trials Xmissmin where the 
misclassification rate has a high-level confidence. 

(12) and (17) specify that if we want to achieve a high-level confidence of 
random case analysis, we should repeat trials of random case analysis (Fig. 1) 
until either of the following conditions is satisfied. 



X > Xmin 



f{F+,F-,5) 

^J-K 



(18) 



Given fiK 



X> X, 



— ^ ” miss min 



f{F+,F-,5) 

1 - ^J-K 



(19) 



X/X, (18) and (19) are transformed into (20) and (21) respectively. 



X>f{F+,F-,6) 



( 20 ) 
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N-X>f{F+,F-,6) 



( 21 ) 



(20) and (21) are the conditions where the random case analysis algorithm can 
be terminated. In other words, the number of trials in random case analysis is 
less than 2 x /(F+, F^ , S) and greater than /(F+, F^ , 5), thereby random case 
analysis can reduce the computational cost for evaluation of learning algorithms. 

2.3 Improvement of Random Case Analysis 

Fig. 2 shows an improved version of random case analysis whose algorithm is 
extended by adding conditions (20) and (21). EXAMPLE(a) takes a as input 
and generates a case (T^, To) = {Ic, (oi, ■ • • , at)) where Ic is the category of a case 
(oi, • ■ • ,Ot). a can be described as (/, {pi,P 2 , ■ ■ ■ ^Pt)), where / determines the 
category of a case and pi is the set of probabilities of attribute values atk occur- 
ring in a case. Note that k is the number of attribute values of a^. LEARN(L) 
learns a conceptual description (3 from a training set L. LEARN(L) is the in- 
ductive learning algorithm to be evaluated. CLASSIFY(/3, T^) determines the 
class of the case Ta according to the conceptual description (3. 

3 Implications for Learning Behavior 

3.1 Comparison with PAC Learning and Average Case Analysis 

First of all, we will compare the learning curves of the wholist algorithm based 
on the framework of random case analysis, PAC learning and average case anal- 
ysis. The wholist algorithm [2] is a predecessor of the one-sided algorithm for 
pure conjunctive concepts. Although wholist is a relatively simple algorithm for 
random case analysis, the theoretical prediction has already been studied by 
using PAC model and average case model. 

PAC model of wholist indicates that the minimal number of training examples 
for learning a monotone monomial concept is as follows: 



where e is an accuracy parameter and 5 is a confidence parameter, n is the 
number of variables in a monotone monomial concept. Given this, we have 



^ An important practical consideration that is ignored in Fig. 2 is the existence of noise 
in the source of examples. The noise can take on various forms and is modeled in two 
categories (i.e., class noise and attribute noise) in our research. We have extended 
EXAMPLE (q) so as to generate a case which contains the noise, but we will not 
mention the detail algorithm for the sake of simplicity. 



m > -{n + log 2 -) 



( 22 ) 




(23) 
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RandomCase Analysis (a, I, F+, F-, 6) 

a : concept description and the probabilities of attribute values 
I : a size of training examples 
F^ upper bound 
Fl lower bound 
6 : confidence parameter 

begin 
X ^0 ■ 

N ^0 ■ 

while {X < f{F+,F-,5) and N - X < f{F+,F-,S)) do 
begin 
L^<P- 
N ^ N + 1 ; 
repeat I do 

Ln {EXAMPLE (a)} ; 

(5 ^ LEARN(L) ; 

(Tc,T„) ^ EXAMPLE(a) ; 
if Tc = CLASSIFY (/3,Ta) then 
A^A + 1; 

end ; 

output X/N ; 
end. 



Fig. 2. Algorithm of random case analysis. 



The left hand side shows the worst performance accuracy and the right hand 
side indicates its upper bound. 

Fig. 3 indicates that the predicted accuracy of random case analysis is very 
similar to that of average case analysis. On the other hand, the predicted accu- 
racy of PAC learning is over-conservative and its learning curve is quite different 
from the others. Note that the experimental condition is as follows: 5 — 0.1, 
n = 3 for PAC learning, mono poi] (a^i, 2:2, 13) = X3 for average case analysis, 
rnono^QQi^(xi,X 2 ,x-i) = x^, F+ = 0.05, = 0.05, S = 0.005 for random case 

analysis. Since the number of possible concept descriptions grows exponentially 
with the number of training examples and the number of attributes, average case 
analysis is limited to a small number of training examples (i.e., 10). 

Fig. 4 shows the computational time required for random case analysis and 
average case analysis^. This figure shows that random case analysis can reduce 
the computational cost for evaluation of learning algorithms even with the large 
number of training examples, whereas computational time of average case anal- 
ysis grows exponentially with the number of training examples. 



^ We used SGI Indy workstation with R4600 133 MHz CPU and 64MB memory under 
IRIX 5.3 OS. The program code is written in GNU Common Lisp 2.2. 
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Fig. 3. Comparison of random case 
analysis with PAC learning and av- 
erage ease analysis. 




Fig. 4. Computational time re- 
quired for analyzing the wholist al- 
gorithm. 



Random Case Analysis o 
Average Case Analysis — 
PAC Learning □ 



3.2 Behavioral Implications of ID3 and C4.5 

To model the induction of full decision trees is too difficult for both PAC learning 
and average case analysis, researchers [5] focused on a simple algorithm that 
constructs one-level decision trees. Recently, we have studied the average case 
analysis of X-lcvcl decision trees in the framework of average case analysis [14]. 
In this experiment, we will show the performance accuracies of IDS and C4.5 by 
using random case analysis. We hold the number of classes constant at 3, and the 
number of attributes constant at 3. Each attribute has 3 values. We hold their 
probabilities of occurrences constant at 1/3. Furthermore, we hold F+ constant 
at 0.05, F~ constant at 0.05 and 5 constant at 0.01. Under this condition, we 
generated a target concept from 24 examples. 

Fig. 5 shows the learning curves of IDS and C4.5 as a function of the number 
of training examples for two different number of irrelevant attributes (i.e., no 
irrelevant attribute and 2 irrelevant attributes) when other domain parameters 
are held constant. In this figure, we can say that IDS and C4.5 are unaffected 
by the number of irrelevant attributes. 

Fig. 6 shows the predicted effects of training examples and two levels of 
class noise (i.e., 0% and 10%) on performance accuracy and learning rate. The 
important point to observe is that, unlike the number of irrelevant attributes, 
the class noise mainly affects the overall rate of improvement. Another point is 
that C4.5 is somewhat robust with respect to class noise, whereas ID3 is not. 

Fig. 7 presents similar results on the interaction between attribute noise and 
the number of training examples. This figure also indicates that attribute noise 
affects the overall rate of improvement. That is, increasing the noise level flattens 
the learning curves of IDS and C4.5. Here we should point out the fact that the 
performance accuracy of C4.5 is too bad if the number of training examples is 
small (i.e., the number of training example is less than 11). This means that C4.5 
generates a tree with no leaf (i.e., a root) if the number of training examples 
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Fig. 5. Learning curves as a function of the number of training examples for two 
different number of irrelevant attributes (IDS and C4.5) 



is not enough for learning. This kind of learning behavior is hard to investigate 
without random case analysis. 




Fig. 6. Learning curves as a func- 
tion of the number of training ex- 
amples for two levels of class noise 
(IDS and C4.5). 




Fig. 7. Learning curves as a func- 
tion of the number of training exam- 
ples for two levels of attribute noise 
(IDS and C4.5). 



3.3 Behavioral Implications of CBL 

In recent years there has been growing interest in methods that store cases or 
instances in memory, and that apply these cases directly new situations. This 
method is called case-based learning [1]. The simplest and most widely studied 
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class of techniques is nearest neighbor algorithm. Aha reported the experimental 
result (which they called CBLl). Langley [8] also studied the theoretical result 
of average ease analysis. Although the basic nearest neighbor algorithm is simple 
enough to construct the average case model, slightly modified version of nearest 
neighbor algorithm (which they called CBL4) is difficult to construct the model. 
CBL4 is an extension of CBLl in the sense that 1) it can reduce storage require- 
ments, 2) tolerate noisy cases, and 3) learn attribute importance, represented as 
attribute weight settings. 

Fig. 8 shows the interaction between two levels of irrelevant attributes and 
the number of training examples. This figure shows that CBL4 is unaffected by 
the number of irrelevant attributes whereas CBLl is more sensitive to irrelevant 
attributes. This figure reveals that CBL4 tends to have slower learning rates 
because it must search for good attribute weight settings. These observations 
are consistent with Aha’s report [1] on the sensitivity of CBL4 to the number of 
irrelevant attributes. 

Fig. 9 shows the predicted effects of training examples and two levels of at- 
tribute noise. In [1], Aha applied CBLl and CBL4 to only two noisy domains 
(i.e., LED-24 and Waveform-40) in UCI machine learning database and con- 
cluded that CBL4 outperforms CBLl. But our random case analysis reveals 
that Aha’s experimental results does not prove that CBL4 will always outper- 
forms CBLl in more general domains. 





Fig. 8. Learning curves as a func- 
tion of the number of training exam- 
ples for two different number of irrel- 
evant attributes (CBLl and CBL4). 



Fig. 9. Learning curves as a func- 
tion of the number of training exam- 
ples for two levels of attribute noise 
(CBLl and CBL4). 



Finally we shows the comparative results of AQ (a simplified version of 
AQll), C4.5 and CBL4. Fig. 10 and Fig. 11 indicates that C4.5 outperforms 
both AQ and CBL4. Fig. 10 shows that AQ is affected by the number of ir- 
relevant attributes, although C4.5 and CBL4 are unaffected by the number of 
irrelevant attributes. Fig. 11 shows that AQ and C4.5 are also dramatically 
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affected by attribute noise, whereas CBL4 is robust compared with the other 
algorithms. 

The second result to note is the peculiar ‘S’ shape of CBL4’s learning curves, 
whereas the other learning curves begin to improve and then level off. Fur- 
thermore, increasing the noise level flattens or stretches the S shape, although 
increasing the number of irrelevant attributes shifts the learning curves some- 
what to the right. Finally, quite surprising is that the performance accuracy of 
AQ decreases as the number of training examples increases. We have no idea to 
explain this phenomenon right now. In spite of this fact, the other theoretical 
approaches cannot reveal this phenomenon and such a learning behavior can 
only be found by random case analysis or experimental analysis with artificial 
data. 




Fig. 10. Learning curves as a func- 
tion of the number of training ex- 
amples for two different number of 
irrelevant attributes (AQ, C4.5 and 
CBL4). 




Fig. 11. Learning curves as a func- 
tion of the number of training exam- 
ples for two levels of attribute noise 
(AQ, C4.5 and CBL4). 



4 Concluding Remarks 

In this paper, we have presented a framework for random case analysis of induc- 
tive learning algorithms. Our framework unifies the formal mathematical and 
the empirical approaches to understand the behavior of inductive learning al- 
gorithms. Random case analysis can explain empirical observations about the 
behavior of various algorithms, make predictions, and guide the development of 
new learning algorithms. We have applied the framework to four different algo- 
rithms. We have verified through experimentation that the analysis accurately 
predict the expected performance accuracy. 
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The framework requires much more information about the training examples 
than both the average case model and the PAG model. Especially, although 
the average case analysis can obtain the precise theoretical value if the average 
case model is correct, our framework is randomized approximation algorithm 
for analyzing learning algorithms. That is, in our framework, we must carefully 
determine the values of F+, F~ and <5 so as to calculate the number of trails. 
Furthermore, the inequalities to calculate the number of trials, (12) and (17), 
are almost proportional to 1/(5^, we must consider the tradeoff between the 
preciseness of analysis and the computational time to be executed. 
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Abstract. Within the present paper, we investigate the principal learn- 
ing capabilities of iterative learners in some more details. The general 
scenario of iterative learning is as follows. An iterative learner succes- 
sively takes as input one element of a text (an informant) of a target 
concept as well as its previously made hypothesis, and outputs a new 
hypothesis about the target concept. The sequence of hypotheses has to 
converge to a hypothesis correctly describing the target concept. 

We study the following variants of this basic scenario. First, we consider 
the case that an iterative learner has to learn on redundant texts or 
informants, only. A text (an informant) is redundant, if it contains every 
data item infinitely many times. This approach guarantees that relevant 
information is, in principle, accessible at any time in the learning process. 
Second, we study a version of iterative learning, where an iterative learner 
is supposed to learn independent on the choice of the initial hypothesis. In 
contrast, in the basic scenario of iterative inference, it is assumed that 
the initial hypothesis is the same for every learning task which allows 
certain coding tricks. 

We compare the learning capabilities of all models of iterative learning 
from text and informant, respectively, to one another as well as to finite 
inference, conservative identification, and learning in the limit from text 
and informant, respectively. 



1 Introduction 

Induction constitutes an important feature of learning. The corresponding theory 
is called inductive inference. Inductive inference may be characterized as the 
study of systems that map evidence on a target concept into hypotheses about it. 
The investigation of scenarios in which the sequence of hypotheses stabilizes to an 
accurate and finite description of the target concept is of some particular interest. 
The precise definitions of the notions evidence, stabilization, and accuracy go 
back to Gold [3] who introduced the model of learning in the limit. 

The general situation investigated in Gold’s model (cf. [3]) can be described as 
follows: Given more and more information concerning the concept to be learnt, 
the learning device has to produce hypotheses about the phenomenon to be 
inferred. The information sequence may contain only positive data, i.e., exactly 
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all elements contained in the concept to be recognized, as well as both positive 
and negative data, i.e., all elements of the underlying learning domain which 
are classified with respect to their containment to the unknown concept. Those 
information sequences are called text and informant, respectively. The sequence 
of hypotheses has to converge to a hypothesis correctly describing the object to 
be learnt. Consequently, the inference process is an ongoing one. 

However, Gold’s model (cf. [,3]) makes the unrealistic assumption that the 
learner has access to the whole initial segment of the information sequence pro- 
vided so far. If huge data sets are around, no learning algorithm can use all the 
data or even large portions of it simultaneously for computing hypotheses about 
concepts represented by the data. Since each practical learning system has to deal 
with the limitations of space, variants of the general approach described above 
restricting the accessibility of input data have been discussed (cf., e.g., [14,9]). 
An intensively studied example is iterative learning. Here, the learning device 
(henceforth called iterative learner) is required to produce its actual hypothe- 
sis exclusively from its previous one and the next element in the information 
sequence. 

Within the present paper, we investigate the principal learning capabilities of 
iterative learners in some more details. Thereby, we confine ourselves to study 
the learnability of indexable concept classes (cf. [1,15]), only. The motivation of 
our study is based on the rather simple observation that there is no learning 
per se. Learning is embedded into scenarios of a more comprehensive usage. 
Such an environment is usually putting constraints on the way information is 
accessible, requirements hypotheses have to meet, and so on. 

For illustration, consider the following scenario which is typical for several 
approaches to case-based reasoning (cf. [6]). A given case-based reasoning system 
is in use, i.e., some user is putting in repeatedly query cases and receives as 
the system’s response proposals how to proceed with the query cases. If the 
proposals arc satisfying, nothing has to be changed. If the outputs do not meet 
the users expectations or the environmental needs, (s)he is requested to provide 
data illustrating the misbehaviour. Based on this information, the system is 
supposed to change its state, and thereby to modify its behaviour appropriately. 
Thus, learning, in particular, some kind of iterative learning takes place. Learning 
succeeds, if the initial state is successfully transfered into a goal state (which 
meets all the users expectations) by processing only finitely many information. 

In order to gain a better understanding of the principal learning capabilities 
of those case-based reasoning systems, it seems to be reasonable to consider them 
as a certain kind of iterative learners. Since the basic model of iterative learning 
does not reflect all their specifics very welf, some modiheations are in order. 

First, in the learning scenario discussed above, it is highly desirable that 
every possible initial state of the system can be transformed into a goal state. 
The initial state of the system reflects a treasure of experiences that have proved 
their usefulness in the past; so it is justified to keep them if possible. Our notion 
of iterative learning with variable initial hypotheses (cf. Dehnition 4) reflects this 
intention. In contrast, in the basic model of iterative inference (cf. Definition 3), 
it is assumed that an iterative learner starts with an a priori fixed hypothesis. 
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The initial hypothesis is the same for all learning tasks, so it does not carry any 
message. Note that this approach has some benefits, too. It gives the learner 
the freedom to code, up to a certain extent, information about the progress 
made in the actual learning task directly into intermediate hypotheses. In the 
modified model, such coding is meaningless, since one has to be aware that every 
intermediate hypothesis serves as initial hypothesis of different learning tasks, 
as well. 

Second, in the basic model of iterative learning, a learner is supposed to 
learn on every possible information sequence. Thus, it may happen that relevant 
data items occur only once in the given information sequence. This may lead 
to situations, in which relevant data items will be overlooked, since they appear 
at the wrong time, and therefore learning may fail. However, this contradicts 
daily life experiences: if information is really important, it will not be presented 
only once. Redundant information sequences have the property that every data 
item appears infinitely many times, and therefore relevant data are, in principle, 
accessible at any time in the learning process. The corresponding learning model 
is called iterative learning from redundant texts and informants, respectively 
(cf. Definition 5). 

As we will see, iterative learners that are supposed to learn from redundant 
information sequences, only, are much more powerful than those that have to be 
successful on every text and informant, respectively. When learning from posi- 
tive data is concerned, iterative learning from redundant information sequences 
is exactly as powerful as conservative inference which itself is less powerful than 
learning in the limit (cf. Corollary 3 and Proposition 1). Interestingly, iterative 
learning from redundant informants is exactly as powerful as learning in the 
limit from positive and negative data (cf. Corollary 11). Consequently, if exclu- 
sively redundant information sequences have to be processed, it is justified to 
use iterative learners instead of unconstrained ones. 

Surprisingly, even iterative learning with variable initial hypotheses from re- 
dundant informants turns out to be of the same learning power as learning in 
the limit (cf. Corollary 11). When learning from positive data is concerned, the 
situation changes. There are concept classes iteratively learnable from arbitrary 
texts that cannot be iteratively learnt with variable initial hypotheses even in 
case that exclusively redundant texts have to be processed (cf. Theorem 5). 

As one may expect, the power of iterative learning with variable initial hy- 
potheses from arbitrary texts and informants, respectively, is rather limited. In 
both cases, the corresponding learning model is incomparable to finite learning 
which itself is known to be very restrictive (cf. Theorems 8 and 15). 

2 Preliminaries 

IN = {0, 1, 2, ...} is the set of all natural numbers. We set IN^ = IN \ {0}. By 
(.,.) : IN X IN ^ IN we denote Cantor’s pairing funetion. We write A ^ B to 
indicate that the sets A and B are incomparable, i.e., A\B ^ % and B\A^%. 

Any recursively enumerable set X is called a learning domain. By p{X) we 
denote the power set of X. Let C C p{X), and let c G C; then we refer to C and c 
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as to a concept class and a concept, respectively. A concept class C is said to be 
inclusion- free iS c c' for all concepts c, c' e C. A concept class is said to be 
superfinite iff it contains all finite concepts over the learning domain X and at 
least one infinite concept. 

In the sequel we deal with the learnability of indexable concept classes with 
uniformly decidable membership defined as follows (cf. [1]). A class of non-empty 
concepts C is said to be an indexable class with uniformly decidable membership 
provided there are an effective enumeration of all and only the concepts 

in C and a recursive function / such that for all j (E IN and all x G X we have: 

fu / 1’ if ^ e 

^ ' y 0, otherwise. 

In the following we refer to indexable classes with uniformly decidable mem- 
bership as to indexable classes, for short. 

Next, we describe some well-known examples of indexable classes. Let S 
denote any fixed finite alphabet of symbols, and let S* be the free monoid 
over S. Then X = E* as well as X = 17+ = S* \ {e} (where e is the empty 
string) serve as learning domains. As usual, we refer to subsets L C E* as to 
languages (instead of concepts). Then, the set of all context sensitive languages, 
context free languages, regular languages, and pattern languages, respectively, 
form indexable classes (cf. [1,4]). 

Let We consider X = IJn>i learning domain. Then, the set of all 

concepts expressible as monomial. 



2.1 Gold-Style Learning from Positive Data 

Let X be the underlying learning domain, let c C A be a concept, and let 
t — {xn)neiN be an infinite sequence of elements from c such that content{t) — 
{xn I n G IN} — c. Then t is said to be a positive presentation or, synonymously, 
a text for c. By text{c) we denote the set of all texts for c. A text t is said to 
be redundant provided that every element from c appears infinitely many times, 
i.e., for all x G c, there are inhnitely many n EE IN with x„ = x. By texV{c) 
we denote the set of all redundant texts for c. Moreover, let t be a text, and 
let y he a number. Then, ty denotes the initial segment of t of length y + 1, 
and ty — {xn I n < y}. Furthermore, let a — Xq; ■ ■ ■ ,Xn be any finite sequence. 
Then we use |cr| to denote the length of a. Additionally, by <j o r we denote the 
concatenation of two finite sequences a and r. 

As in [3] we define an inductive inference machine (abbr. IIM) to be an algo- 
rithmic device working as follows: The IIM takes as its input larger and larger 
initial segments of a positive presentation. After processing an initial segment, 
the IIM outputs a hypothesis, i.e., a number encoding a certain computer pro- 
gram. More formally, an IIM maps finite sequences of elements from X into 
numbers in IN . 

The numbers output by an IIM are interpreted with respect to a suitably 
chosen hypothesis space H.. Since we exclusively deal with indexable classes C, 
we always take as a hypothesis space an indexable class H = (h,)j(= 2 /v- The 
indices are regarded as suitable finite encodings of the concepts described by the 
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hypotheses. When an IIM outputs a number j, we interpret it to mean that the 
machine is hypothesizing hj . Clearly, Ti. must be defined over the same learning 
domain X over which C is defined, and, moreover, Ti. must comprise the target 
concept class C. 

Let t be a positive presentation, and let y G IN. Then we use M{ty) to denote 
the hypothesis produced by M when fed the initial segment ty. The sequence 
{M(ty))y^iN is said to converge to the number j iff all but finitely many terms 
of the sequence are equal to j. 

Now, we define some models of learning. We start with learning in the limit. 
Definition 1 ([3]). Let C be an indexable class, let c be a concept, and let 
Ti = [hj)j£]N be a hypothesis space. An IIM M LimTxtT-i -identifies c iff, for 
every t G text[c), there exists a j G W such that the sequence {M{ty))y^]N 
converges to j and c = hj . 

M LimTxtji -identifies C iff, for each c ^C, M LimTxtTj-identifies c. 

Finally, let LimTxt denote the collection of all indexable classes C for which 
there are an IIM M and a hypothesis space H such that M LimTxtu ^identifies C. 

In the above definition Lim stands for “limit” . Suppose, an IIM identifies 
some concept c. That means, after having seen only finitely many data of c the 
IIM reaches its (unknown) point of convergence and it computes a correct and 
finite description of the target concept. Hence, some form of learning must have 
taken place. 

In general, it is not decidable whether or not an IIM M has already converged 
on a text t for the target concept c. Adding this requirement to Definition 1 
results in finite learning (cf. [3]). The corresponding learning type is denoted by 
FinTxt. 

Now, we define eonservative IIMs. Intuitively, eonservative IIMs maintain 
their actual hypothesis at least as long as they have received data that “provably 
misclassify” it. 

Definition 2 ([!]). Let C be an indexable class, let c be a concept, and let 
H = {hj)j(z[N be a hypothesis space. An IIM M ConsvTxtjj-identifies c iff M 
LimTxt'j-c-identifies c, and, for every t G text{c) and for all y G IN , if M{ty) -f 
M{tyj^x) then t+^^ % hM(ty)- 

M ConsvTxtji-identifies C iff, for each c G C, M Con.svTxtji -identifies c. 

The learning type ConsvTxt is analogously defined as above. 

As it turned out, for proving some of the stated results, it is conceptually 
simpler to use the characterization of conservative learning equating it with 
set-driven inference (cf. [10]). Set-drivenness has been introduced in [13] and 
describes the requirement that the output of an IIM is only allowed to depend 
on the range of its input. More formally, an IIM M is said to be set-driven with 
respect to C iff M{ty) = M{ty>) for all y,y' G IN, and all texts t,t for concepts 
in C provided By s- LimTxt we denote the collection of all indexable 

classes for which there are a hypothesis space Tt and a set-driven IIM M that 
Lim identifies C. 
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2.2 Formalizing Variants of Iterative Learning from Positive Data 



Looking at the above definitions, we see that an IIM M has always access to 
the whole history of the learning process, i.e., in order to compute its actual 
guess M is fed all examples seen so far. In contrast to that, next we define 
iterative inductive inference machines. An iterative IIM is only allowed to use 
its last guess and the next element in the positive presentation of the target 
concept for computing its actual guess. 

More formally, let A be the underlying learning domain. Then, an itera- 
tive IIM M is an algorithmic device that maps elements from IN x X into W. 
Let t = (xn)neJN be any text for some concept c C X, and let k be M’s ini- 
tial hypothesis. Then, we denote by (Mn{k,t))neiN the sequence of hypotheses 
generated by M when successively fed t, i.e., Mo{k,t) = M{k,xo) and, for all 
n G IN, Mn+i{k,t) = M[Mn{k,t),Xn+i)- Within the next definition we assume 
that M’s initial hypothesis equals 0. 

Definition 3 ([14]). Let C be an indexable class, let c be a concept, and let 
Ti. = be a hypothesis space. An iterative IIM M ItTxt-H -identifies c iff, 

for every t G text{c), there exists a j ^ IN such that the sequence (M„(0, t))n6W 
converges to j and c = hj. 

Finally, M ItTxtji-identifies C iff, for each c G C, M ItTxtT-i-identifies c. 

The resulting learning type ItTxt is analogously defined as above. 

Subsequently, we use the following convention. Let a be any finite sequence 
of elements over the relevant learning domain. Then, we denote by M,f{k,a) the 
last hypothesis output by M when successively fed a (as above, k denotes the 
initial hypothesis). 

Within the following definition we consider a variant of iterative learning, 
where an iterative IIM has to learn successfully no matter which initial hypothe- 
sis has been selected. 

Definition 4. Let C be an indexable class, let c be a concept, and let H — (hj)j^ jv 
be a hypothesis space. An iterative IIM M It" Txtu-identifies c iff, for every 
t G text{c) and every initial hypothesis k G IN , there exists a j £ IN such that 
the sequence (M„(fc, t))„g£v converges to j and c= hj. 

Finally, M It" Txtu -identifies C iff, for each c £ C, M It" Txtu -identifies c. 

The resulting learning type It" Txt is analogously defined as above. 

Finally, we define versions of the models of iterative learning introduced 
above, where for the iterative IIMs it is sufficient to learn from redundant texts, 
only. More formally: 

Definition 5. LetC be an indexable class, let c be a concept, and lefH = 
be a hypothesis space. An iterative IIM M RTxt"j^ [It" Txtff\ -identifies c iff, for 
every redundant text t £ text"{c) [and every initial hypothesis k £ IN], there 
exists a j £ IN such that the sequence (Mn{0,t))neiN [(M„(fc,f))„g jv] converges 
to j and c — hj. 

Finally, M ItTxffi [It" Txfjfi-identifies C iff, for each c £ C, M ItTxfjj 
[It" Txtff-identifies c. 

The learning types ItTxt" and It" Txt" are analogously defined as above. 



78 



Steffen Lange and Gunter Grieser 



3 Iterative Learning from Positive Data 



In this section, we compare the learning capabilities of all models of iterative 
learning from positive data to one another as well as to finite inference, learning 
in the limit and conservative identification from text. 

First, we summarize the previously known results (cf. [7,8,9,10]). 
Proposition 1. FinTxt c ItTxt c ConsvTxt = s-LimTxt c LimTxt. 

In case that it is guaranteed that the relevant information appears infinitely 
many times in a text, any conservative learner can be simulated by an iterative 
IIM that has the same learning power. More formally: 

Theorem 1. ConsvTxt C ItTxC . 

Proof. Let X be the relevant learning domain over which C is defined. Assume 
C £ ConsvTxt. Applying the characterization of ConsvTxt from [8] , we know that 
there are a hypothesis space Ti — {hj)j^}N and a computable function T that 
assigns a finite telltale set T) to every hypothesis hj. More formally, on every 
input j £ IN , T enumerates a finite set Tj and stops. Furthermore, Tj has the 
following features: (1) Tj C hj and (2), for all k £ m,T, C hk implies hk hj. 

Let T = (Tj)jgiv denote any repetition free enumeration of all finite subsets 
of X, where Tq = 0- Furthermore, we assume an effective procedure comput- 
ing, for every finite set PCX, its uniquely determined index ff{F) in T . 
In order to show that C £ ItTxP, we select the following hypothesis space 
^ = {h{j,k,i))j,k,ieiN- For all j,k,l £ IN, let h^o,k,i) = Fk and h/^j+i^k.l) = hj. 

The desired iterative IIM M is defined as follows. Suppose any {j, k,l) G IN 
and any x £ A. Note that, by definition of Cantor’s pairing function, (0, 0, 0) = 0 
which is, by definition, M’s initial hypothesis. 

IIM M: “On input {j, k, 1) and x do the following: 

Determine F' = FkU {x}. 

If j = 0 or X hj-i, check, for all z = 0, . . . , 1 + 1, whether or not T^ C 
F' C hz. If there is a 2 : passing this test, fix the least one, say z, and 
output the hypothesis {z + 1, ff{F'), 1 + 1). Otherwise, output the hypothesis 

(0,#(T'),^ + 1>- 

If X £ hj_i, test, for all z = 0, . . . , 1, whether or not x £ T^. If such a z is 
found, output the hypothesis {j, ff{F'), 1). Otherwise, output the hypothesis 

Due to space limitations, a verification of M’s correctness is omitted. I 

Interestingly, iterative IIMs cannot outperform conservative learners, even if 
the iterative IIM has to learn from redundant texts, only. Thus, the weakness of 
iterative learners (compared to the capabilities of unconstrained IIMs) cannot 
be compensated, even if the relevant data appear infinitely many times in the 
texts. 

In order to elaborate the result mentioned above we heavily exploit the fact 
that conservative learners are exactly as powerful as set-driven IIMs (cf. Propo- 
sition 1). Thereby, we adapt an idea from [-5] and [9]. 

Theorem 2. ItTxP C s-LimTxt. 
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Proof. Let be the relevant learning domain over which C is defined, and 
assume C € ItTxP . Then there are an IIM M and a hypothesis space T~L = 
{hj)jeiN such that M /tTit^-identifies C. For proving C e s-LimTxt, first we 
construct a suitable hypothesis space H — Let T — {Fj)j^iN and 

ff{F) be defined as in the demonstration of Theorem 1 above. Then, we define 
fi 2 j = hj and /i 2 j+i = Fj for every j G IN . For every non-empty finite set T G X, 
we dehne ref{T) = Xq,Xi, , x^^ird{T)-i to be the repetition free enumeration 
of all the elements of T in lexicographical order. Furthermore, we set exh{T) = 
XqOXq.Xi OXq,Xi,X2 O ■ • • O Xq , Xi , . . .,Xcard{T)-l ■ 

The desired set-driven IIM M takes as its inputs finite sets T, and is defined 
as follows: 

IIM M\ “On input T do the following: 

Determine exh{T). Check, for all x G T, whether or not M*(0, ex/j(T)) = 
M*(0, exh{T) o x). 

If it is, output 2-M*(0, exh{T)), and request the next input. Otherwise, output 
2 ■ #(T) + 1, and request the next input.” 

By dehnition, M is set-driven. For showing that M -infers C, let c G C, 

and let t G text{c). We distinguish the following cases. 

Case 1. c is finite. 

Then, there exists axi n G IN with = c. It suffices to show that c = 

If M{c) = 2 ■ #(c) + 1, we are done, by construction. Otherwise, for all x G c, 
we have M*(0, ex/i(c)) = M*(0, exli(c) ox). Let j — M*(0, ex/i(c)). Hence, M 
converges to j when fed the redundant text exh{c) o re/(c) o ref{c) o ref{c) o ■ ■ ■ 
from texF(c). Since M learns c, we are done. 

Case 2. c is inhnite. 

Let F ~ {xj)j^iN be the lexicographically ordered positive presentation for c. 
Thus, = xqOxo, XiOxo, Xi, X 20 - ■ ■ is a redundant text from texF{c). Since M 
/fTxf”-learns c from there exists an uq G IN such that M*(0, = 

for all n > riQ. By the choice of we immediately obtain that 
M,(0, ox) = M»(0, for all x G c. Since M learns c from F^^ we have 
c = hj = fi 2 j. Let (7 = ^no+i' Finally, since t G text{c), there exists an index mo 
such that cr+ C t+^. Thus, cr is a prehx of exh{t^^), and hence — 2j for 

all m > mo. I 

Furthermore, taking into consideration that ItTxt C ConsvTxt (cf. [9]), we 
may easily conclude: 

Corollary 3. ItTxt C ItTxF = ConsvTxt. 

Next, we show that intermediate hypotheses have to be used to reflect the 
progress made in the learning process. Without this option, iterative learners are 
unable to exploit the whole extra information that is provided within redundant 
texts. In order to achieve the announced insight, we start with a theorem that 
illuminates the structural properties of IF Txf”-learnable concept classes. 
Theorem 4. Let C be an indexable class. C G IF TxF iff C is inclusion-free. 
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Proof. First, suppose that an inclusion-free indexable class C = {cj)j^n\f is 
given. Select the hypothesis space H = that meets, for all j,n S 

IN, = Cj. Then, the following IIM M /t’' Tirt^-identifies C. For all fc e W 

and all possible input data x, let M{k,x) ~ min{j\ j > k, x G hj}. We omit 
the details. 

Next we show that If" Txt ’’-identifiable classes must be inclusion-free. To see 
this assume, for a moment, that there is an indexable class C G It" Txt" that is 
not inclusion-free. Hence, there are an iterative IIM M and a hypothesis space 
7i such that M It" Txt^-identifies C. Let c,c' E C with c' C c, let k be some 
final hypothesis^ of M for c, and let t and t' be any redundant text for c and c', 
respectively. Since c' E c and due to choice of the initial hypothesis k, we know 
that, for all n E IN, M,f{k,tn) = M^{k,t'.jf) = k, and therefore M fails to learn 
at least one of both concepts. I 

Theorem 5. ItTxt ff It" Txt" . 

Proof. Consider the class of all finite concepts over the learning domain X. 
Clearly, C/;„ G ItTxt, but Cfm is not inclusion-free, and therefore Cf^ ^ It" Txt" , 
by Theorem 4. On the other hand, let C be the class of all concepts Cj over the 
learning domain X = {a}^ with Cj = {a}+ \ Clearly, C is inclusion-free, 

and, by Theorem 4, C G It" TxP . Since C ft ItTxt (cf. [9]), we are done. I 

Furthermore, since It" Txt C It" Txt" and It" Txt C ItTxt, we obtain: 
Corollary 6. 

(a) It" Txt E It" Txt" . 

(b) It" Txt E ItTxt. 

Our next result puts the weakness of It" Txt into the right perspective. 
Theorem 7. FinTxt \ It" Txt ^ 0. 

Proof. Let C be the indexable class that contains exactly all c C {a}"*" with 
card[c) — 2. Obviously, C G Fin, Txt. On the other hand, one easily verifies that 
even the finite subclass C that contains the concepts {a, a^}, {a, a^}, and {a^, a^} 
does not belong to It" Txt. To see this suppose that there are an iterative IIM M 
and a hypothesis space Tt such that M It" Tx'f— idcntihcs C . Let k be some final 
hypothesis of M for {a^,a^}. Thus, M[k,a?‘) — M{k,af) — k. Consequently, M, 
when starting with the initial hypothesis k, outputs exactly the same sequence of 
hypotheses when fed the texts t = a? ,a,a, . . . for {a, a^} and t' = ax' , a, a, .. . for 
{a, a^}. Thus, M must fail to learn at least one of both concepts, a contradiction. 

I 

However, It" Txt may outperform FinTxt, as well. 

Theorem 8. FinTxt ff It" Txt. 

Proof. By Theorem 7, it remains to show that It" Txt \ FinTxt ^ 0. Let 
{Pj)j€lN denote any fixed programming system of all (and only all) partial re- 
cursive functions over IN, and let be any associated complexity mea- 

sure (cf. [2]). Let C be the indexable class that contains all concepts C 2 j = 
and C 2 j+i = Note that C 2 j = C 2 j+i = 

{a^b} in case that is undefined. One easily verifies that C G It" Txt. On the 

^ fc is said to be a final hypothesis of M for c provided that M[k,x) = k for all x E c. 

Note that k must exist, since M, in particular. It” Tat^-identifies c. 
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other hand, an IIM M that finitely learns C can be used to decide the halting 
problem. On input j G IN, one has to check in parallel, for all z = 0,1,..., 
whether < z or M, when processing the initial segment tz of the text 

t = a^b,a^b, . . outputs its final and correct hypothesis. One easily verifies that 
in the latter case ‘Pj(j) must be undefined, since, otherwise, M cannot finitely 
learn both, C 2 j and C 2 j+i- I 

It is quite obvious that FinTxt contains exclusively indexable concept classes 
that are inclusion-free. Hence, we may conclude: 

Corollary 9. FinTxt C It'’ TxN . 

The following picture displays the achieved separations and coincidences of 
the considered learning types. Each learning 
type is represented as a vertex in a directed 
graph. A directed edge (or path) from vertex A 
to vertex B indicates that A is a proper subset 
of B, and no edge (or path) between these 
vertices imply that A and B are incomparable. 



LimTxt 

t 

ItTxF = ConsvTxt = s-LimTxt 
ItTxt It'’ TxF 

txt 

FinTxt It” Txt 



4 Iterative Learning from Positive and Negative Data 

Next, we study iterative learning from positive and negative data. Thus, we have 
to introduce some more notations and definitions. 

Let X be the underlying learning domain, let c C A be a concept, and 
let % — {{xn,bn))neiN be any sequence of elements of A x { + , — } such that 
content{i) — {xn \ n G IN} = A, | n G IN, bn = +} = c and i^ = 

{xn I n G IN , bn — — } — X \ c — c. Then, we refer to i as an informant. By 
info{c) we denote the set of all informants for c, and by info”{c) the set of all 
redundant informants for c, i.e., informants having the property that, for all 
X E X, there are infinitely many n E IN with = x. We use iy to denote 
the initial segment of i of length y + 1, and define iy = {xn \ n < y, bn = +} 
and iy = {xn \n<y,bn = -}. 

Furthermore, let c C A, and let (x, 6) G A x {+,—}. Then, c is said to be 
consistent with {x,h) (abbr. cons{c,{x,h))) provided that a:: G c, if 6 = +, and 
X ^ c, otherwise. 

The learning models Liminf and Finlnf are defined analogously as their text 
counterparts by replacing text by informant. Finally, we extend the definitions 
of all variants of iterative learning in the same way, and denote the resulting 
learning types by It Inf , It” Inf , Itlnf”, and It”Inf”, respectively. 

As in the previous section, we first summarize the known results (cf. [7]). 
Proposition 2. Finlnf c Itinf c Liminf. 

In contrast to the text case, iterative learning from redundant positive and 
negative data is at least as powerful as learning in the limit from informant. This 
add-on in learning power can also be observed, if iterative learners have to be 
successful no matter which initial hypothesis has been selected. 

Theorem 10. Let C be an indexable class. Then, C G IV’Inf”. 
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Proof. Let C = {cj)j^iN be an indexable concept class. Select the hypothesis 
space Ti. = that meets, for all j,n £ IN, h(^j n) = Cj. We claim 

that the following I'lM M /L"/n/5^-identifies C. For all /c £ IN, and all data 
{x, b) G X X {+, — }, we let M{k, {x, b)) = min{j \ j >k, cons{hj, {x, 6))}. 

Clearly, AL implements the identification by enumeration principle (cf. [3]), 
and M , when fed any redundant informant for some c £ C, converges to the least 
j > k that meets hj = c, where k is AL’s initial hypothesis. I 

Finally, since, by definition, It^Inf'" C and since every indexable con- 

cept class is Liminf -identi&ahle (cf. [3]), we can conclude: 

Corollary 11. It^Inf’" = Itlnf'~ = Liminf. 

The picture changes drastically, if iterative learning from non-redundant in- 
formants is considered. However, in contrast to the text case, If" Inf contains 
relatively rich concept classes. 

Observation 12. Cfin £ It" Inf . 

Proof. As in the proof of Theorem 1, let T = denote any repetition 

free enumeration of all finite subsets of the learning domain X and assume 
any effective procedure computing, for every finite set F C X, its uniquely 
determined index ff{F) in F . We choose T as hypothesis space and define the 
needed iterative learner M as follows. Let k G IN and {x, b) be given. Then, 
we let M{k, (x, b)) = #(Ffc U {x}), iib = +, and M{k, {x,b)) = #{Fk \ {a:}), if 
6 = — . Further details are omitted. I 



If the target concept class contains both hnite and infinite concepts, it might 
be inevitable to select the initial hypothesis appropriately. To see this let Cg 
be the indexable class that contains the concept cq = {a}^ and all singleton 
concepts Cj = {a^}, j G W+, over the learning domain X = {a}’*'. 

Observation 13. Cg It" Inf . 

Proof. Suppose to the contrary that there are an iterative learner M and a 
hypothesis space H — such that Af .^-identifies Cg. Since M, in 

particular, learns cq, there has to be some final hypothesis k of M for cq, i.e., 
M{k,[a"^,+)) = k for all m £ W+. Next, consider the hypotheses M^{k,in), 
n G IN , generated by M when successively processing the canonical informant 
i — (a, +), (a^, —), (a~^, —), . . . for the concept ci £ C. Since Af has to infer ci, 
there have to he j G JN and z £ that satisfy hj — ci, M^,{k,iz) — j, 

and Af (j, (a"*, — )) = j for all m > z. Finally, consider the informant i = 
(a"^, +), {of, -),..., -), -), (a, -), -), , -),... and i = 

-£), (fl2, -),..., -), {a^, -), (a, -), (a^+^ -), (a^+'^ -),... for = 

{a^} and c^+i — respectively. One immediately sees that, for all n G IN , 

Mf,{k, i„) = M^{k, ijf), and thus Af fails to infer at least one of both concepts, a 
contradiction. I 



One easily verifies that the concept class Cg is a subclass of the well-known 
family of pattern languages (cf. [i]). Moreover, the proof idea presented above 
can easily be adapted to show that there is no superfinite class at all that belongs 
to It" Inf . Hence, we have: 

Corollary 14. 

(a) The set of all pattern languages Cpat is not It" Inf -identifiable. 
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(b) Let C be any indexable class that is superfinite. Then, C ^ If" Inf . 

Moreover, it is well-know that Cpat S Fininf and Cpn ^ Fininf (cf., e.g., [15]). 
Hence, we may conclude: 

Theorem 15. Fminf # It" Inf . 

Since Fininf C Itinf (cf. Proposition 2) and It" Inf C Itinf, we obtain the 
missing part in the picture for the informant case. 

Corollary 16. It" Inf C Itinf. 

The following figure summarizes the established 
relations of the considered learning types for the 
informant case. The semantics is analogous to that 
of the corresponding figure in Section 3. 
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Abstract. We consider the exact learning in the query model. We deal 
with all types of queries introduced by Angluin: membership, equiva- 
lence, superset, subset, disjointness and exhaustiveness queries, and their 
weak (or restricted) versions where no counterexample is returned. For 
each of all possible combinations of these queries, we uniformly give com- 
plete characterizations of boolean concept classes that are learnable using 
a polynomial number of polynomial sized queries. Our characterizations 
show the equivalence between the learnability of a concept class C using 
queries and the existence of a good query for any subset H of C which 
is guaranteed to reject a certain fraction of candidate concepts in H re- 
gardless of the answer. As a special case for equivalence queries alone, 
our characterizations directly correspond to the lack of the approximate 
fingerprint property, which is known to be a sufficient and necessary 
condition for the learnability using equivalence queries. 



1 Introduction 

With the remarkable advances in computer and network technology, a large 
quantity of data obtained from scientific experiments is available. It is an urgent 
and very important problem to establish methods to discover some rules which 
explain such a large quantity of data. Because the data is so large, it is expected 
to use computers to analyze the data. One approach is to apply a machine 
learning system which learns concepts from examples, in order to discover rules 
automatically. Moreover, a successful learning algorithm using queries would 
give us a good strategy to make experiments within a reasonable amount of 
time, in order to identify underlying rules. For these purpose, we have to clarify 
the possibilities and limitations for computers to learn concepts from examples. 

The exact learning model due to Angluin [1] is one of the most popular 
models in the field of learning theory. In this model, a learner is required to 
identify a target concept exactly using queries which give partial information 
about the target concept to the learner. Angluin introduced six kinds of queries. 
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membership, equivalence, superset, subset, disjointness, and exhaustiveness. In 
some cases, weak (or restricted) versions of queries are often used, where no 
counterexample is provided to a learner. 

Among these queries, membership and equivalence queries have been focused 
on especially, and there have been some individual approaches to show combi- 
natorial properties in order to characterize the learnability using each of three 
combinations of membership and equivalence queries. For equivalence queries 
alone, Angluin [2] introduced a notion of approximate fingerprint property as a 
tool for proving non-learnability. Gavalda [4] showed that the property (with a 
slight modification) can be used to prove the converse: if a concept class does 
not have an approximate fingerprint property, then the concept class is exactly 
learnable using a polynomial number of polynomial sized equivalence queries. 
(See also [3]). For membership queries alone, Goldman and Kearns [5] showed 
that the teaching dimension gives a lower bound for the number of member- 
ship queries required to learn. Hegediis [6] generalized it so that the generalized 
teaching dimension of a concept class is polynomial if and only if the concept 
class is learnable using a polynomial number of membership queries alone. For 
the combination of membership and equivalence queries, Hellerstein et al. [7] and 
Hegediis [6] independently gave an elegant combinatorial property called poly- 
nomial certificates as a necessary and sufficient condition for polynomial-query 
learning. 

In this paper, we give combinatorial properties which uniformly characterize 
the learnability for each of any possible combinations of all queries introduced 
above. We will give the characterizations in an abstract form so that we can 
easily generalize it for another kind of queries, not specihe to these six queries. 

Our characterizations are based on the following two intuitions: The hrst one 
is that if a learner can ask a good question to a teacher about the unknown 
target concept, then the concept is easy to learn. Otherwise, it might be hard 
to learn. Here, we regard that a question is good if at least a certain fraction of 
concepts will be rejected, no matter how the answer is returned by the teacher. 
If there always exists a good question for any subset of a concept class, then the 
learner can use it to reduce the hypothesis space efficiently. Otherwise, that is, 
if there is no good question for some subset of a concept class, adversary teacher 
can answer maliciously so that little information will be given to the learner to 
identify the target concept. In fact, this was a key idea to prove that the lack of 
approximate hngerprint property is a necessary and sufficient condition for the 
learnability using equivalence queries alone [2,4]. 

The second intuition is that a learner can identify any target concept exactly 
if and only if the learner can conhrm that the hypothesis is absolutely correct 
by using queries. We introduce a notion of specifying queries in order to capture 
the intuition. When equivalence queries are available, it is a trivial task, since 
the learner can directly confirm whether the hypothesis is correct or not. 

We apply these intuitions for each type of queries, and capture the essence of 
query complexity of exact learning using each of any possible combinations of all 
these queries. The technicalities of the proofs may not be quite new since they 
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are rather straightforward extensions appeared in the literature [2,4,6]. However, 
our characterizations will be applied for any kind of queries, not restricted to the 
ones mentioned above. Since a query corresponds to an experiment in scientific 
discovery, we hope that our characterizations will lead us an efficient strategy to 
choose and perform experiments among a large number of possible experiments. 



2 Preliminaries 

We adopt the terminology from [7,8]. Let S be an alphabet. Then E* denotes 
the set of all finite length strings over E, and denotes the set of all strings 
over E of length exactly n. For a string w G E* , j|w|j denotes the length of w, 
for a set S, ]5j denotes the cardinality of S. 

A Representation of concepts TZ = {E, A, R, fi) is a 4-tuple where E and A are 
finite alphabets, i? is a subset of A* , and /r is a map from R to subsets of A*, 
called concepts, i? is a set of representations., and fi is the map that specifies 
which concept is represented by a given representation. For any concept c, Xc 
denotes the characteristic function of c. For any string w, Xc{w) = 1 if te is in c 
and Xci.w) = 0 otherwise. The size of a concept c is min{|jr|| : ji[r) = c}. 

Throughout this paper, we assume that for any representation class TZ, the 
following problems are computable: 

1. For a given string r G A*, decide if r £ R. 

2. For a given string w G E* and r G A* , decide if re G ia{r). 

The concept class C by 72. is a set of concepts that have representations in R. 
For any positive integer m, C„i = {^(r) : r & R,\\r\\ < m}, and C = Um>i 

In this paper, we deal with boolean concept classes only. Thus let us assume 
that E = {0, 1}. A boolean concept c is a subset of A” for any positive integer n. 
When it causes no confusion, we will use c itself to denote Xc- If 72- is a boolean 
representation class, each r e R will represent a boolean formula over n variables, 
and the concept is a set of assignments to the variables that satisfies the function. 
For any positive integers m and n, let Cm,n = ■ lkll < and /r(r) C 

if”}, and Cn = sequel, we identify a concept c G C with its 

representation r G 72 with //(r) = c when it is clear from the context. 

We assume several oracles which give some information about a target con- 
cept c* to a learner. We may regard them as experiments to identify the target 
concept. In the literatures, six oracles have been introduced as follows. For each 
string V G A”, the membership oracle Mem returns “Yes” if c*{v) = 1 and “No” 
otherwise. Moreover, for each concept h G C, we define Equivalence (Equ), Su- 
perset (Sup), Subset (Sub), Disjointness (Dis), Exhaustiveness (Exh) oracles 
and their weak versions (wEqu, wSup, wSub, wDis, wExh) as in Table 1. 
However, in this paper, we do not have to restrict the queries to those ones, 
since our characterizations would not be specific to these oracles. For a query cr 
and a concept c, we denote by c[a] the set of possible answers for c when ask- 
ing cr. We denote by j|CTj| the length of a query a. For example, for a membership 
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Table 1. The definitions of oracles. The first row represents the types of oracles. 
The second row represents conditions when each oracle returns “Yes” , and the 
third row “Xo” . The last row shows the condition which a counterexample should 
satisfy. For instance, the weak equivalence oracle wEqu answers “Yes” if h = c*, 
and answers “No” if h ^ c* . The equivalence oracle Equ answers “Yes” if h = c* , 
and returns a counterexample w with w (E {hU c*) — {hC\ c*) if h ^ c* . 





Yes 


No 


w 


Mem 


c*{v) = 1 


* 

It 

0 




Equ, wEqu 


h = c* 


h A c* 


w € {hU c*) ~ {hC\c*) 


Sup, wSup 


h D c* 


h ^ c* 


w E c* — h 


Sub, wSub 


hCc* 


h ^ c* 


w Eh -c* 


Dis, wDis 


/i n c* = 0 


hnc* A ^ 


w EhUc* 


Exh, wExh 


/i U c* = N" 


hue” A 


WE S" - (hue*) 



query a\, c[ai] is {“Yes”} or (“No”}, and for an equivalence query CT 2 , c[ct 2 ] is 
{ “Yes” } or the set of all counterexamples. 

The query com,plexity of learning algorithm A is the sum of the lengths of 
queries and counterexamples returned by oracles. Note that the length of a 
counterexample is always n, since we consider only boolean concepts. 

Definition 1. Let Q be a set of queries. A concept class C is polynomial-query 
learnable using Q if there exists an algorithm A and a polynomial p(-,-) such 
that, for any positive integers m, n and an unknown target concept c* C Cm.n- 

1. A gets n as input. 

2. A may ask queries in Q. 

3. A eventually halts and outputs r (E R with p(;r) = c*. 

4 . The total query complexity of A is at most p{m,n). 

In Section 3, we consider the case where the size m of a target concept is addi- 
tionally given to a learner. 

3 Good Queries 

We introduce a notion of good queries in order to characterize polynomial-query 
learnability where the size of a target concept is known to a learner. Intuitively, 
a query is good for a set T of concepts, if a certain fraction of T are eliminated 
by the query no matter how the answer is returned. 

Definition 2. For a concept class T, a query a and its answer a, we define 
Cons(fT,a,a) be the set of concept in T that is consistent with a and a. That 
is, 



Cons{T , a, a) = {h EE T \ a (E h[a]}. 
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Algorithm LearnerI (m,n : positive integers) 

Given Q : available queries 
begin 

H . — Om,nj 

while \H\ > 2 do 

Find a query a £ Q that is l/g(m, n)-good for H; (*) 

Let Q is the answer to the query cr; 

H := Cons{H,a ,a) 

endwhile; 

if \H\ = 1 then output the unique hypothesis h in H 
else output “Target concept is not in Cm,n” 

end. 

Fig. 1. Algorithm LearnerI, where the size m of a target concept is known. 



Definition 3. A query a is 5-good for a concept class T if for any answer a, 

I Cons(T, a,a)\ < (1 — 5)|T|. 

Theorem 1. Assume the size m of the target concept is known to a learner. A 
concept class C is polynomial- query learnahle using Q if and only if there exist 
polynomials q{-, ) and p{-, ) such that for any positive integer’s m, n and any 
T C Crn,n with \T\ > 2, there exists a query a in Q with ||cr|| < p{m,n) that 
IS l/q{m,n)-good for T. 

Proof, (if part) Let p{-, •) and g(-, •) be polynomials such that for any positive 
integers m,n and any T C C^.n with |T| > 2, there exists a query cr that 
is l/g(m, n)-good for T. We show a learning algorithm using queries in Q in 
Figure 1. It is not hard to verify that all procedures in the algorithm, such as 
Cons, are computable, since we only deal with boolean concepts. 

First we show the correctness of the algorithm. Since H is initialized as Cm.n, 
and the target concept c* is assumed to be in Cm,n, H contains c* before the 
first stage. Since c* is consistent with any answer returned by the oracles in Q, 
and at any stage H is updated so that only inconsistent concepts are eliminated 
from H, c* is never eliminated. Moreover, whenever |iL| > 2, we can find a 
query that is l/g(m, n)-good for H in Q. Thus the output of the algorithm is 
guaranteed to be exactly equal to the target concept c* . 

We now show that the total number of queries is 0{rn ■ p{rn, n)). We denote 
the set H at i-th stage of the algorithm by Hi, and I be the number of the stages. 
We can show that for any stage i = 1, 2, ..., I — 1, 
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regardless of the answer from an oracle in Q. Since Hq is initialized as Cm,m 
have 



m < 




— i ^ 

p{m, n) ) 



C 



m,n 



we 



for any i. We can show that the right part becomes at most one if z > p{m, n) ■ 
In \Cm,n\ by simple calculations, which ensures the termination of the algorithm. 
Recall that \Cm,n\ < ( 1^1 + for any m and n, since any concept in Cm,n 

is represented by a string over A of length at most m. Since at each stage, 
exactly one query is asked to an oracle, the total number of query is 0{m ■ 
p{m,, n)). Since the length of each query is at most p{m, n) and the length of each 
counterexample returned by oracles is n, the query complexity of the algorithm 
is 0{m ■ p{m, n){p{m, n) -\- n)), which is a polynomial with respect to m and n. 



(only if part) Assume that for any polynomials p{-,-) and q{-,-)y there exist 
positive integers m,n and a set T C Cm,n with |T| > 2 such that there exists no 
query cr that is ljq{m, •n)-good for T. Suppose to the contrary that there exists 
a learning algorithm A that exactly identifies any target concept using queries 
in Q, whose query complexity is bounded by a polynomial p'{m,n) for any m 
and n. Let p{m, n) — p'{rn, n) and q{m, n) — 2p'{m, n). 

We construct an adversary teacher who answers for each query cr in Q as 
follows: If ||cr|| > p{m,n), the teacher may answer arbitrarily, say 'Wes”. (Since 
the query complexity of A is bounded by p'{m,n) — p{m,n), actually A can 
never ask such a query.) If ||cr|| < p[m,n), the teacher answers a such that 
I Cons{T, cr, a)| > ^1 — j \T'\- By the assumption, there always exists such 

a malicious answer. The important point is that for any query cr, its answer a 
returned by the teacher contradicts less than ljq{m, n) fraction of concepts in T. 
That is, 

T| — \ ConsiT ,(j,a)\ < — r|^|- 

q{m, n) 

Since the query complexity of A is p'{m,n), at most p'{m,n) queries can be 
asked to the teacher. 

Thus the learner can eliminate less than [p'{m, n)/q{m, n))\T\ concepts after 
p'{m,n) queries. Since q{m,n) — 2p' {m,n), more than (1/2)|T| concepts in T 
are consistent with all the answers. Moreover, since \T\ > 2, at least two distinct 
concepts from T are consistent with all the answers so far. Since A is determin- 
istic, the output of A will be incorrect for at least one concept in T, which is a 
contradiction. □ 



4 Specifying Queries 

This section deals with the case where the size m of a target concept is unknown 
to a learner. The standard trick to overcome this problem is to guess m incre- 
mentally and try to learn: initially let m = 1, and if there is no concept in 
that is consistent with the answers given by oracles, we double m and repeat. 
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Algorithm Learner 2 (n : positive integer) 

Given Q : available queries 
begin 
m = 1; 

repeat 

simulate LEARNERl(m, n) using Q; 
if LearnerI outputs a hypothesis h then 

Let Q be a set of specifying queries for /i in Cn ; 
if h is consistent with the answers for all queries in Q then 
output h and terminate 
m = m * 2 
forever 
end. 

Fig. 2. Algorithm Learner 2, where the target size m is unknown 



For some cases, such that the equivalence query is available, or both subset and 
superset queries are available, we can apply the trick correctly, since the learner 
can confirm the hypothesis is correct or not by asking these queries. The next 
definition is an abstraction of the notion. 

Definition 4. A set Q of queries is called specifying queries for a concept c 
in T if the set of consistent concept in T is a singleton of c for any answer. 
That is, 

{h G T \ h[cr] = c[o-] for all a G Q} — {c}. 

For instance, if the equivalence oracle is available, the set {Equ(c)} is a trivial 
specifying queries for any c in C. Moreover, if both the superset and subset oracles 
arc available, the {wSup(c), wSub(c)} is also specifying queries for any c in C. 
If the only membership oracle is available, our notion corresponds to the notion 
of specifying set [6]. 

Theorem 2. A concept class C is polynomial- query learnable using Q if and 
only if there exist polynomials </(■, ■), p(-, ■) and r(-, ■) such tha,t for any positive 
integers m and n, the following two conditions hold: 

(1) for any T C Cm.n with \T\ > 2, there exists a query a in Q with ||cr|| < 
p{m,n) that is l/q{m,n)-good for T. 

(2) for any concept c G Cn, there exist specifying queries Q Q Q for c in Cn 
such that IIQII < r{m,n). 

Proof, (if part) We show a learning algorithm Learner2 in Fig. 2, assuming 
that the two conditions hold. The condition (2) guarantees that the output of 
Learner2 is exactly equal to the target concept, while the condition (1) assures 
that LearnerI will return a correct hypothesis as soon as m becomes greater 
than or equal to the size of the target concept. 
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(only if part) Assume that the concept class C is polynomial-query learnable 
by a learning algorithm A using queries in Q. We have only to show the condi- 
tion (2), since Theorem 1 implies the condition (1). Let n > 0 and c G be 
arbitrarily fixed. Let Q be the set of queries asked by A when the target concept 
is c. We can verify that Q is specifying queries, since the output of A is always 
equal to the target concept c. Since the size of Q is bounded by a polynomial, 
the condition holds. □ 

Let us notice that the above theorem uniformly gives complete characteriza- 
tions of boolean concept classes that are polynomial-query learnable for each of 
all possible combinations of the queries such as membership, equivalence, super- 
set, subset, disjointness and exhaustiveness queries, and their weak versions. 

Moreover, as a special case, we get the characterization of learning using 
equivalence queries alone in terms of the approximate fingerprint property. We 
say that a concept class C has an approximate fingerprint property if for any 
polynomials p(-, •) and there exist positive integers rn,n and a set T C 

Cm,n with \T\ > 2 such that for any concept h G have |{c G T | 

h{w) = c(ru)}| < \T^\ for some w G A”. Since equivalence queries contain 

a trivial single specifying query for each concept, we get the following result. 

Corollary 1 ([2,4]). A concept class C is polynomial- query learnable using 
equivalence queries if and only if C does not have an approximate fingerprint 
property. 



5 Conclusion 

We have shown uniform characterizations of the polynomial-query learnabilities 
using each of any combinations of all queries, such as membership, equivalence, 
superset queries, etc. Our results reveal that the polynomial-query learnability 
using a set of oracles is equivalent to the existence of a good query to the oracles 
which eliminate a certain fraction of any hypothesis space. This is quite intuitive. 

In this paper, we only dealt with boolean concepts. We will generalize our 
results to treat general concepts in future works. Moreover, it is also interesting to 
investigate the computational complexity of the learning task for honest concept 
classes with polynomial query-complexity, in the similar way as shown by Kdbler 
and Lindner [8] , where they showed that oracles are sufficient for the learning 
using each of three possible combinations of membership and equivalence queries. 
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Abstract. In their previous paper, Mukouchi and Arikawa discussed 
both refutability and inferability of a hypothesis space from examples. If 
a target language is a member of the hypothesis space, then an inference 
machine should identify it in the limit, otherwise it should refute the 
hypothesis space itself in a finite time. They pointed out the necessity of 
refutability of a hypothesis space from a view point of machine discovery. 
Recently, Mukouchi focused sequences of examples successively generated 
by a certain kind of system. He call such a sequence an observation with 
time passage, and a sequence extended as long as possible a complete 
observation. Then the set of all possible complete observations is called 
a phenomenon of the system. 

In this paper, we introduce phenomena generated by rewriting systems 
known as OL systems and pure grammars, and investigate their inferabil- 
ity in the limit from positive examples as well as refutable inferability 
from complete examples. 

First, we show that any phenomenon class generated by OL systems is 
inferable in the limit from positive examples. We also show that the 
phenomenon class generated by pure grammars such that loft hand side 
of each production is not longer than a fixed length is inferable in the 
limit from positive examples, while the phenomenon class of unrestricted 
pure grammars is shown not to be inferable. We also obtain the result 
that the phenomenon class of pure grammars such that the number of 
productions and that of axioms are not greater than a fixed number is 
inferable in the limit from positive examples as well as refutably inferable 
from complete examples. 



1 Introduction 

Inductive inference is a process of hypothesizing a general rule from examples. 
As a correct inference criterion for inductive inference of formal languages and 
models of logic programming, we have mainly used Gold’s identification in the 
limit [4]. An inference machine M is said to identify a language L in the limit, if 
the sequence of guesses from M, which is successively fed a sequence of examples 
of L, converges to a correct expression of L. 

* Supported in part by Grant-in-Aid for Scientific Research on Priority Areas 
No. 10143104 from the Ministry of Education, Science and Culture, Japan. 
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In this criterion, a target language, whose examples are fed to an inference 
machine, is assumed to belong to a hypothesis space which is given in advance. 
However, this assumption is not appropriate, if wo want an inference machine to 
infer or to discover an unknown rule which explains examples or data obtained 
from scientific experiments. That is, the behavior of an inference machine is not 
specified, in case we feed examples of a target language not in a hypothesis space 
in question. By noting this point, many types of inference machines have been 
proposed (cf. Blum and Blum [2], Sakurai [20], Mukouchi and Arikawa [13,15], 
Lange and Watson [10], Mukouchi [14], Kobayashi and Yokomori [9], and so on). 

In their previous paper, Mukouchi and Arikawa [13,15] discussed both 
refutability and inferability of a hypothesis space from examples of a target 
language. If a target language is a member of the hypothesis space, then an 
inference machine should identify it in the limit, otherwise it should refute the 
hypothesis space itself in a finite time. They showed that there are some rich 
hypothesis spaces that are refutable and inferable from complete examples (i.e. 
positive and negative examples), but refutable and inferable classes from only 
positive examples (i.e. text) are very small. 

Recently, Mukouchi [10] focused sequences of examples successively generated 
by a certain kind of system. He call such a sequence an observation with time pas- 
sage, and a sequence extended as long as possible a complete observation. Then 
the set of all possible complete observations is called a phenomenon of the system. 
For example, we consider positions of a free fall with a distinct initial value. Then 
the sequence (0, —1, —4, —9, — 16, • • ■) of positions is a complete observation, and 
the sequence ( — 1, —2, —5, —10, —17, • • •) is another complete observation gener- 
ated by a physical system. Then we consider the set of such sequences a phe- 
nomenon generated by the system. We can regard a language as a phenomenon 
each of which complete observation consists of each word in the language. Fur- 
thermore, we can also regard a total function f : N ^ N as a phenomenon 
whose complete observation consists of a unique sequence (/(O), /(I), /(2), • • ■) 
over N, where N is the set of all natural numbers. Thus, by thinking learn- 
ability of phenomena, we can deal with learnability of both languages and total 
functions uniformly. On learnability of phenomena, Mukouchi [16] has presented 
some characterization theorems. 

In the present paper, we discuss inferability of rewriting systems known 
as L systems and pure grammars. L systems arc rewriting systems introduced 
by Lindenmayer [6] in 1968 to model the growth of filamentous organism in 
develop- 
mental biology. There are two types of L systems: One is an IL system, in which 
we assume there are interactions, and the other is a OL system, in which no in- 
teractions. In this paper, we consider only OL systems. A OL system and a pure 
grammar are both rewriting systems that have no distinction between terminals 
and nonterminals. In the world of a OL system, at every step in a derivation every 
symbol is rewritten in parallel, while in a pure grammar just one word is rewrit- 
ten at each step. Many properties concerning languages generated by L systems 
and pure grammars are extensively studied in Lindenmayer [6,7], Gabrielian [3], 
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Maurer, Salomaa and Wood [8] and so on. Inductive inferability in the limit of 
languages generated by L systems and pure grammars from positive examples 
arc studied in Yokomori [22], Tanida and Yokomori [21] and so on. 

In developmental systems with cell lineages, the observed sequences of 
branching (tree) structures arc represented in terms of sequences of strings 
(cf. Jiirgensen and Lindenmayer [5]). In this paper, we define phenomena by 
all possible sequences of strings whose initial segments, i.e., observations are 
derivations generated by OL systems or pure grammars, and investigate their 
inferability in the limit from positive examples as well as refutable inferability 
from complete examples. First, we show that any phenomenon class generated 
by OL systems is inferable in the limit from positive examples. We also show 
that the phenomenon class generated by pure grammars such that left hand side 
of each production is not longer than a fixed length is inferable in the limit from 
positive examples, while the phenomenon class of unrestricted pure grammars is 
shown not to be inferable. We also obtain the result that the phenomenon class of 
pure grammars such that the number of productions and that of axioms are not 
greater than a fixed number is inferable in the limit from positive examples as 
well as refutably inferable from complete examples. These results may be useful 
to construct a machine learning/discovery system in the field of developmental 
biology. 

2 Rewriting Systems 

Let Y be a fixed finite alphabet. Each element of E is called a constant symbol. 
Let Y+ be the set of all nonnull constant strings over E and let E* = Y+ U {A}, 
where A is the null string. 

Let N = {0, 1, 2, • ■ ■} be the set of all natural numbers. 

For a string w £ E*, the length of w is denoted by |u;|. 

Let p = (wo,wi, ■ ■ ■) be a (possibly finite) sequence over E* and let n G iV be 
a nonnegativc integer. The length of p is denoted by |//|. In case /r is an inhnitc 
sequence, we regard |/i| = oo. By /r[n], we denote the (n+ l)-st element of /j, 
and by we also denote the initial segment (wq,wi, • ■ ■ , of length n, if 

n ^ Ia*I> otherwise it represents /r itself. 

Let iS be a set of sequences over E*. We put | p £ S'} and 

0(S) = {/i I /r is a finite initial segment of some sequence in S}. A pair (T, F) 
of sets of finite sequences over E* is said to be consistent with S, if L C 0{S) 
and F n 0{S) = 4> hold. 

2.1 L Systems and Pure Grammars 

A OL system was introduced by Lindenmayer [6] in 1968 and a pure grammar 
was introduced by Gabrielian [3] in 1981. We recall their languages, and then 
introduce phenomena generated by them. 

Definition 1 . A production is an expression of the form a — > /?, where a £ E^ 
is a nonnull constant string and /3 £ Y* is a constant string. 
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A OL system is a triple G — (A, P, w), where A is a finite alphabet, w G 
is a nonnull constant string which we call an axiom and P is a hnite set of 
productions which satisfies the following two conditions: 

(1) For any production (a ^ /3) G P, |o!| = 1, i.e., a G A holds. 

(2) For any a G A", there exists at least one production (a — > /?) G P such 
that a — a holds. 

A OL system G = (A, P, w) is deterministic, or a DOL system for short, if for 
any a £ S, there exists just one production (a -^ /?) G P such that a — a holds. 

A OL system G — (S, P,w) is provaqatinq, or a POL system for short, if for 
any (a ^ /?) G P, |/3| > 1 holds. 

A OL system is a PDOL system, if it is a POL system as well as a DOL system. 

We denote by 0£, POL, POL or VPOL the class of all OL systems, that of all 
DOL systems, that of all POL systems or that of all PDOL systems, respectively. 

Definition 2. A pure qramm,ar is a triple G = {S,P,S), where A is a finite 
alphabet, P is a finite set of productions and S C is a finite set of constant 
strings each of which we call an axiom. 

Let n > 1 be a positive integer. A pure grammar G = (A, P, S) is a Pure<„ 
grammar, if for any production (a ^ /?) G P, 1 < |a| < n holds. 

A pure grammar G = (A, P, S) is context-free, or a PCF grammar for short, 
if G is a Pure<i grammar. 

We denote by Pure, Pure<n or PCP the class of all pure grammars, that of 
all Pure<„ grammars or that of all PCF grammars, respectively. 

Definition 3. Let G = (A, P, re) be a OL system. We define a binary relation 
=>G over A* as follows: For any a G A+ and any (3 G A*, a P if and only if 
there exist productions (a^ ^ Pi) G P (1 < i < n) such that a = a\a 2 ■ ■ ■ an and 
P = P 1 P 2 ■ ■ ■ Pn hold, where n — |a|. 

Let G = (A, P, S) be a pure grammar. We define a binary relation =><3 over 
A* as follows: For any a G A+ and any P G A*, a =><3 P if and only if there 
exists a production { 0 / P') G P and constant strings u,v E A* such that 

a — ua'v and P — uP'v hold. 

Let G be a OL system or a pure grammar. 

When no ambiguity occurs, by omitting the explicit reference to G, we simply 
write a => /? to denote a P- Furthermore, by we represent the reflexive 
and transitive closure of ^g- 

Definition 4. We define the language L{G) generated by a OL system G — 
{E,P,w) as follows: 

L(G) — {u \ w =>G a}. 

We define the language L{G) generated by a pure grammar G = (A, P, S) as 
follows: 

L(G) = {u I G S s.t. w =>G u}. 
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Definition 5. Let G — {S, P, w) be a OL system. A complete observation gener- 
ated by G is a (possibly finite) sequence pL — (wo,wi, ■ ■ ■) over S* which satisfies 
the following three conditions: 

(1) Wo = w. 

(2) For any i with 0 < i < |yu|, tCi =>g tCi+i holds. 

(3) If /X is a finite sequence which ends with then there is no u G S* such 

that Wn u holds. 

Let G = (A, P, S) be a pure grammar. A complete observation generated 
by G is a, (possibly finite) sequence p = (tuo, wi, ■ ■ •) over S* which satisfies the 
following three conditions: 

(1) Wo G S. 

(2) For any i with 0 < i < \fi\, Wi =>g uti+i holds. 

(3) If p is a finite sequence which ends with Wn, then there is no u G S* such 

that Wn u holds. 

Let G be a OL system or a pure grammar. A phenomenon V{G) generated 
by G is the set of all complete observations generated by G. 

We call any finite initial segment of a complete observation generated by G 
an observation. 

Now we present examples of languages and phenomena generated by a OL 
system and a pure grammar. 

Example 1. Let G = ({a}, {a ^ a. a —>■ a^}, a) be a OL system. Then the 
language L(G) and the phenomenon P(G) generated by G are as follows: 

L(G) = {a” I n > 1}, 

{ (a,a,o, ••■), (a,a,a^,- ■ ■), ] 

(a, a}, a?, ■ ■ ■), (a, a^,a^, ■ ■ ■), (a, a^, a^, • • •), I 

= {(a”«,o"ba"L • ■ •) I »^o = 1, m-i < rii < 2ui-i {i > 1)}. 

Let G = ({a, 6}, {a ab, b ^ ba}, {ab}) be a PCF grammar. Then the 
language L(G) and the phenomenon V{G) generated by G are as follows: 

L(G) = {ab, abb, aba, abaa, abab, abba, abbb, ■ ■ ■}, 

f (ab, abb, abbb, ■ ■ •), (ab, abb, abab, ■ ■ •), (ab, abb, abba, • • •), ) 

V(G) — < (ab, aba, abba, ■ ■ ■), (ab, aba, abaa, ■ ■ •), (ab, aba, abab, • • •), > . 



2.2 Some Properties 

Definition 6. For two OL systems G = (E, P, w) and G' = (S, P' , w'), we write 
G < G', if F C F' and w = w' hold. 

For two pure grammars G = (E, P, S) and G' = (E, P' , S'), we write G < G', 
if F C F' and S C S' hold, and we also write G < G' , if G < G' and G ^ G' 
hold. 
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Let T be a set of sequences over S* and let G be a OL system or a pure 
grammar. Then we say G is reduced with respect to T, if (i) T C 0{V{G)) holds 
and (ii) for any G' < G, T ^ 0{V{G')) holds. Moreover, we simply say G is 
reduced^ if G is reduced with respect to 0{V{G)). 

Proposition 7. Let n > 1 be a positive integer and let G = {E,P,S) be a 
reduced, Pure<„ gram,m,ar. 

Then every production in P is used in the derivations in V{G)^^^\ where ki = 
1 + ko + + ■ ■ ■ + kj} with ko = |i7| . 

Proposition 8. Let G = {S,P, w) be a reduced OL system. 

Then every production in P is used in the derivations in V{G)^^^\ where k\ — 

i + |r|. 

Definition 9. For constant strings u,v ^ S* . we define sets of productions as 
follows (n > 1): 



Prod<„(u, v) = a (3 



Prod*(u, u) = {q ^ /3 I a e , P G E* , 3s, t G E* s.t. u = sat, v = s/3t}, 

a E E^, P E E*, |a| < n, 

3s, t E E* s.t. u = sat, V = spt 

n = \u\, Oi G E, Pi G E* 

ProdoL(w,1^) = {ai^ Pi,02^ P2,- ■ -^Cln^Pn U = ai 02 ■ - -an, 

V = P 1 P 2 ■ ■■ Pn 



For a set T of sequences over E* , we put 



Prod^T) = U U Prod*(/Li[f],yu[i + 1]). 

ueTo<i<\ij,\ 



We also define Prod<„(T) and ProdoL(T), similarly. 

It is easy to see that for any finite set T of finite sequences over E*, Prod*(T), 
Prod<„(T) (n > 1) and ProdoL(P) are finite sets. 

A finite-set- valued function F{u, v) is said to be recursively generable, if there 
is an effective procedure that on inputs u and v enumerates all elements in F{u, v) 
and then stops (cf. Lange and Zeugmann [11]). 

Proposition 10. For constant strings u,v G E* , the sets 

Prod*(u,t>), Prod<„(rt,u) (n > 1) and ProdoL(u, v) are recursively generable. 



3 Inferability of Rewriting Systems 



Let 0 be a class of OL systems or pure grammars. We denote by L{Q) (resp., 
V{Q)) the class of languages (resp., phenomena) generated by OL systems or pure 
grammars in Q. For example, L{0C) or V{0C) represents the class of languages 
or that of phenomena generated by OL systems, respectively. 

Due to the space limitation, we omit the detailed definitions of learnability. 
For definitions and properties of language learning, please refer to Angluin [1], 
Gold [4], Osherson, Stob and Weinstein [17], Lange and Zeugmann [12], and 
so on. For those of refutable learning and phenomena learning, please refer to 
Mukouchi and Arikawa [13,15], and Mukouchi [16]. 
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3.1 Languages Classes Generated by Rewriting Systems 

We summarize results concerning inferability of language classes generated 
by OL systems or pure grammars. 

Theorem 11 (Yokomori [22]). (1) L(0£) and L{V0C) are not inferable m the 
limit from positive examples. 

(2) L(fPT>QC) is inferable in the limit from positive examples. 

Theorem 12 (Tanida and Yokomori [21]). L{VCJ-) is not inf erahle in the limit 
from positive examples. 

We know from the latter theorem that L(7Ytre<„) and L{Vure) are not in- 
ferable in the limit from positive examples. 

On refutable inferability from complete examples, the following results are 
valid: 

Theorem 13. (1) L(0£) and L{V0C) are not refutably inferable from complete 
examples. 

(2) L{VV0C) is refutably inferable from complete examples. 

Theorem 14. L{'PCiF), L{Vure<n) , L[Vure) are not refutably inferable from 
complete examples. 

Proof. We shall show only the case of L{VC!F). The other cases are shown sim- 
ilarly. 

By Corollary 6 in Mukouchi and Arikawa [15], it is sufficient for us to show 
that L(fPCJ-) contains every nonempty finite language over E. 

Let L C E* be a nonempty finite language. Put G — {E, P, L), where P — 
{a — > a I a e E}. Then G is a PCF grammar, and L{G) = L holds. Therefore 
we have L G L{VCJ-). 

Thus L{VCJ-) contains every nonempty finite language. □ 

3.2 Phenomenon Classes Generated by OL Systems 

Theorem 15. P(0£), V{V0C), V{T>0C) andV{VT>0C) are inferable in the limit 
from positive examples. 

Proof. It is sufficient for us to show that V(0C) is inferable in the limit from 
positive examples, because each of other classes is a subclass of 'P(OL). 

First, we consider the following algorithm: 



Algorithm SubIIM(S, 6) 

Input : a class Q of finitely many OL systems, 

a finite sequence 6 of observations; 
Output : a OL system G; 

begin 
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let m = \g\ and Q = {Gi, G2, • • • , G^}; 

let n be the length of S and <5 = //i , 1 ; hn ; 

T • {M1 ! ; ■ ■ ■ ; hn} 5 

search for the least index i {1 < i < m) such that 

• T C 0{V{G,)), and 

• Vj (l<j<m), [T C 0{V{G,)) ^ G(P(G,)W) g G(iP(G,)("))]; 
if such an index i is found then return Gi else return Gi; 

end. 



Similarly to the proof of Theorem 19 in Mukouchi [14], we can show the 
following claim: 

Claim: Let 5 be a class of finitely many OL systems, let V £ V{Q) be a phe- 
nomenon and let cr be a positive presentation of V. 

Then there exists an n G N and a G G Q such that for any m > n, 
SubIIM(0, fj[m]) = G and V = V[G) hold, where a[m] represents the finite 
initial segment of a of length m. 

Now we consider an IIM which infers V{QC) in the limit from positive exam- 
ples: 



Algorithm IIM 

Input : positive examples; 

Output : a OL system G; 

begin 

let 5 be the null sequence; 

To := 0; fci := 1 + |T'|; i := 1; 

repeat 

read the next example 

T? . — Ti—\ U {/ii} , 1 , /Xi, 

xco:=/x,[0]; P, := ProdoL(T/"^^); (1) 

g^-.= {{S^P,wo)\PQPi}- ( 2 ) 

G, :=SubIIM(g„50; 

output Gi] 

X := X + 1; 

forever; 

end. 



In this algorithm, we note that for any x > 1, at (1) above is a finite .set 
of productions, and thus at (2) is a class of finitely many OL systems. 

Let V G P(0£) be a phenomenon, let ct be a positive presentation of V and 
let ki — 1 + \E\. Since is a hnite set of finite sequences, when we feed 

examples of a successively to the algorithm above on its input requests, there 
exists an xq £ A such that for any x > xg, and thus Pj = Pi„ and 

g% — Qio hold. We note that wq at (1) does not change, because OP system has 
a unique axiom. 
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Let G = {S, P, w) be a reduced OL system w.r.t. V. Then, by Proposition 8, 
we know that P C P^^ holds. Furthermore, clearly w — wq holds. Therefore, 
G G Qig holds, and thus wc have V G V{Gig)- Thus, by the claim above, we 
conclude that the algorithm above converges to G' with V — V(G') for cr. □ 

Lemma 16 (Mukouchi [16]). If a class C of phenomena is refutably inferable 
from complete examples, then for any phenomenon Pq ^ C, there exists a pair 
(T,F) of finite sets of finite sequences such that (i) {T,F) is consistent with Pq 
and (a) (T,F) is inconsistent with every P gC. 

Theorem 17. P{Q£) and P{P0C) are not refutably inferable from complete ex- 
amples. 

Proof. We shall show only the case of P{0C). The case of P(P0C) is shown 
similarly. 

Let Po = {{a,b,b,b, ■ ■ ■), {a,bb,bb,bb, ■ ■ ■), {a,b\b\b\ ■ ■ ■), ■ ■ ■} = 

{{a,F,F,F, ■ ■ •) I i > 1} be a phenomenon. It is easy to see that Pq ^ P{0£) 
holds. 

Let (T, F) be a pair of finite sets of finite sequences such that (T, F) is 
consistent with Pq. Put G = (F, P, a), where P = {a —)■ b‘ \ (a, F, F, ■ ■ ■ , F) G 
T} U {6 ^ 6}. Then G is a OL system, and T C 0{P{G)) holds. Furthermore, 
wc sec that 0{P{G)) C 0{Pq) holds, and thus (T,F) is consistent with P{G). 

Therefore, by Lemma 16, it follows that P{QC) is not refutably inferable from 
complete examples. □ 

Theorem 18. P(X>0£) and PifPVOC) are refutably inferable from complete ex- 
amples. 

3.3 Phenomenon Classes Generated by Pnre Grammars 

Theorem 19. P{Pure) is not inferable in the limit from positive examples. 

Proof. By Corollary 6 in Mukouchi [16], it is sufficient for us to show that there 
exists an infinite sequence of phenomena Po,Pi,P2, ■ ' ' ^ P{Pure) such that 

0{Pi) c 0{p2) c • . • C 0{Po) and 0{Po) = |J 0{P,). 

^>l 

Let E = {a, 6, c} be an alphabet. We define finite sets of productions as 
follows: 

Po = {a ^ beb, c c^}, P^ = {a ^ beb, beb bc^b, • ■ ■ , bF~^b — > bFbf 

{i > 1 ). 

Let Gi = {E, Pi, {a}) be a pure grammar, and put Pi = P{Gi) {i G N). 
Then 

Po = {(a, beb, bc^b, ■ ■ ■)}, Pi — {(a, beb, bc?b, ■ ■ ■ , bc^b}} (i > 1). 

It is easy to see that these phenomena satisfy the condition above, and thus 
P{Pure) is not inferable in the limit from positive examples. □ 
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Similarly to the proof of Theorem 15, we can show the following theorem: 

Theorem 20. V{VCJ^) and V{Vure<n) O'T'e inferable in the limit from positive 
examples. 

Theorem 21. V{VCJ-), V{Vure<n) o,nd V{Vure) are not refutably inferable 
frorncornplete examples. 

Proof. We shall show only the case of 'P{PCtF). The other cases are shown 
similarly. 

Let Vo = {(a, 5), {a,bb), ■■■, ■ ■ •} = {(a, 6*) | z > 1} be a phe- 

nomenon. It is easy to see that Vo ^ V{VCJ-) holds. 

Let (T, F) be a pair of finite sets of finite sequences such that (T, F) is 
consistent with Vo- Put G = (i7, P, {a}), where P = {a ^ F \ {a,F) G T}. 
Then G is a PCF grammar, and T C 0{V{G)) holds. Furthermore, we see that 
0{V{G)) C 0{Vo) holds, and thus {T,F) is consistent with V{G). 

Therefore, by Lemma 16, it follows that V{VCF) is not refutably inferable 
from complete examples. □ 

On the other hand, as shown below, when we restrict the number of produc- 
tions and that of axioms by a fixed number, the phenomenon class is shown to 
be inferable in the limit from positive examples as well as refutably inferable 
from complete examples. 

Definition 22 . Let n > 1 be a positive integer. A pure grammar G = (A, P, S) 
is a Pure-" grammar, if \P\ < n and \S\ < n hold. 

We denote by Vure-"^ the class of all Pure-" grammars. 

Theorem 23 . Let n > 1 be a positive integer. 

(1) V{Vure-^) is inferable in the limit from positive examples. 

(2) P(Pure-") is refutably inferable from complete examples. 

4 Concluding Remarks 

In the present paper, we have introduced phenomena generated by OL systems 
and pure grammars, and discussed inferability in the limit from positive examples 
as well as refutable inferability from complete examples. We can summarize as 
in Table 1 the results obtained so far, where for a class C, C G LIM-TXT means 
that C is inferable in the limit from positive examples and C G REF-INF means 
that C is refutably inferable from complete examples. 

We see by this table that the learning power is increased in general when 
we receive as examples observations of a phenomenon rather than words of a 
language. 

As related works, Sakakibara [18] and Sakamoto [19] studied language learn- 
ing of a class of context-free languages using structural informations. We note 
that phenomenon learning of classes generated by rewriting systems uses infor- 
mations of derivations but never assumes their structural informations. 
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Table 1. Inferability of classes generated by rewriting systems 





Language Class 


Phenomenon Class 


OL systems 


L(0£) LIM-TXT 
L(0£) i REF-INF 


*1 


P(0£) e LIM-TXT 
P(0£) ^ REF-INF 


POL systems 


L(P0£) i LIM-TXT 
L(P0£) i REF-INF 


*1 


ViVQC) e LIM-TXT 
P(P0£) REF-INF 


POL systems 




P(P0£) 6 LIM-TXT 
P(P0£) e REF-INF 


POOL systems 


L{VV0C) e LIM-TXT 
L(PP0£) e REF-INF 


*1 


P(PP0£) e LIM-TXT 
P(PP0£) e REF-INF 


Pure grammars 


L{Vure) (f LIM-TXT 
L(-Pure) ^ REF-INF 


*2 


VCPure) LIM-TXT 
V\Vure) i REF-INF 


Pure<n grammars 


L{Vure<„) i LIM-TXT 
L{Pure<n) i REF-INF 


P Pure<„) e LIM-TXT 
P(Pure<„) REF-INF 


PCF grammars 


LifPCT) i LIM-TXT 
L{PCT) i REF-INF 


*2 


ViVCF) e LIM-TXT 
V{VCT) i REF-INF 


Pure-"^ grammars 




v\vure-'^) G LIM-TXT 
V{Ture-) e REF-INF 



Note: The results marked as *1 are obtained in Yokomori [22] and those marked 
as *2 are obtained in Tanida and Yokomori [21]. 
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Abstract. Software tools for genomic researches like homology search 
are very useful and have contributed on the progress of the genomic 
researches. However, these tools are not designed directly toward scien- 
tific discovery and more discovery-oriented software tools are strongly 
expected to assist scientific discovery in genomic researches. We have 
designed and developed a multistrategic and discovery-oriented system 
Genomic Hypothesis Creator by introducing two notions: view on data 
and view space on data. With these newly defined notions, we describe a 
View Designer, a component of Genomic Hypothesis Creator, which dy- 
namically creates new views on data and searches a view space for more 
appropriate views. A good view obtained from Genomic Hypothesis Cre- 
ator makes it possible for us to understand the data and eventually attain 
to the goal of discovery. Genomic Hypothesis Creator can be extended 
by adding user’s own views on data and hypothesis generators into the 
system with plug-in interfaces. Therefore it would be feasible to aj)j)ly 
this system to other problems than genomic researches. 



1 Introduction 

Currently, genome sequencing projects have been organized for about 70 organ- 
isms, and sequence and structural databases have accumulated into gigabytes 
in size. Some projects have already finished sequencing and complete genomes 
arc open to public; Escherichia coli, 4,639,221bp, (1997); Haemophilus influen- 
zae, l,830;135bp, (1995); Bacillus subtilis, 4,214,814bp, (1997); Saccharomyces 
cerevisiae, 12,069,313bp, (1996). In 1998, the sequencing project of Caenorhabdi- 
tis elegans, a multicellular organism, will be finished with the complete genome 
of size lOOMbp. Recently, it has been announced that the human genome se- 
quencing shall be finished until 2001 [16]. The human haploid genome consists 
of 3,000Mbp of DNA and encodes 75,000~100,000 genes. Till the end of 1997, 
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more than 50,000 genes have been identihed but only about 5,000 of them have 
known functions. Facing with a coming huge production of such genomic se- 
quences, the issue of assisting scientific discovery by using data compiled in 
databases is a matter of utmost concern in Genome Science. 

Machine learning and data mining technology have been used for knowledge 
discovery and prediction in many fields [10] and successful results for scientihe 
data have been surveyed in [6]. In recent years, especially, the applications of 
machine learning techniques to a variety of “real-world” problems have been 
provided [9]. For genomic data which is one of “real-world” data, it is also very 
promising that the knowledge discovery using machine learning techniques will 
play an important role in the process of scientific discovery. 

Many institutes (NCBI, EMBL/EBI, MIPS, DDBJ, TIGR, GenomeNet, etc.) 
provide services with the databases GenBank, EMBL (typical databases of DNA 
sequences), SWISS-PROT, PIR (typical databases of amino acid sequences of 
proteins), and many other databases [20] whose data are text files in specific 
record formats. The homology search and conventional information retrieval with 
keywords are the most common services. Although these services have made con- 
siderable contributions in assisting scientific discoveries in genomic researches, 
more discovery-oriented services would change the style of research and speed up 
the process of scientific discovery. There have been some researches on discovery 
of specihe topics, c.g., [4,15] and these contributions should be highly appreci- 
ated. However, it is a challenge to design and develop a general system which 
can strongly assist the process of scientihe discovery with the above databases 
for Genome Science. 



View Designer 



Viewl View4 




Hypothesis Generator 



Fig. 1. The design concept of Genomic Hypothesis Creator. 



In the history of science, it has been repeatedly witnessed that an invention of 
a new view on data is a key to scientihe discovery. Aiming at discovery-oriented 
service for genome science with the motivation above, we have developed a sys- 
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tern Genomic Hypothesis Creator which is designed to discover a new view on 
data in text databases. The process of knowledge discovery from databases starts 
with data collection and ends with knowledge as is overviewed in [7], The com- 
ponents of Genomic Hypothesis Creator are shown in Fig. 1 which basically 
follows the KDD process in [7]. Contributions of this paper are notions of view 
over sequences and view space over sequences with which we have designed View 
Designer. Section 2 gives formal definitions of a view and a view space and their 
relation to visualization of hypotheses that is reflected to the design of Visualizer. 
View Designer is aiming at a tool for users to discover manually /automatically 
a new view on data which yields a better understanding of data. Distinctive fea- 
tures can be also found in Data Collector and Hypothesis Generator shown in 
section 3 and 4, respectively. For Data Collector, we employed and modified the 
text database management system SIGMA [1] which has been used for service 
at the Computer Center of Kyushu University since Genomic Hypothesis Cre- 
ator assumes text data. As is indicated in [17], the main bottleneck for scicntihc 
knowledge discovery applications is not the lack of techniques for data analysis. 
The problem is to exploit and combine existing algorithms effectively. From this 
point. Genomic Hypothesis Creator employs the multi strategy principle [12] and 
the plug-in architecture used in Kepler [18]. By linking a view created with View 
Designer to a hypothesis generator selected from the pool of hypothesis gener- 
ators, Genomic Hypothesis Creator provides a diversity of knowledge discovery 
tools. 

2 View Designer for Discovery from Genomic Sequences 

2.1 View and Discovery 

Informally, a view on data provides terms with which we understand and explain 
the data. A discovery or definition of a new view on data is a key to scientific 
discovery. Discovering a view on data has been usually done by experts of the 
field since it is the most important part of discovery process. However, when 
we have to deal with data on which we cannot assume any experts, this process 
turns to be the most difficult obstacle in discovery. 

In the design of Genomic Hypothesis Creator, we focus our attention on 
this matter. We define the notion of a view in a rather abstract way. With this 
definition, we can separate the process of discovery into the process of view 
design/discovery and the process of hypothesis generation. 

Definition 1. Let A be a finite alphabet called a data alphabet. We call a 
string s in U* a E-sequence. Let T be a finite alphabet called a view representa- 
tion alphabet. A view over U-sequences is a pair M = (U, L) of an algorithm V 
with two input parameters and a set L F* satisfying the following conditions: 

1. V takes two strings s G E* and tt G U* as input. If tt G L, then V on 
(s,7t) outputs a value U(s,7t) in a set W. Otherwise, V on (s, tt) outputs 
■undefined” . 
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2. V on (s, 7 t) runs in polynomial time with respect to |s| and |7 t|. 

We call the algorithm V the view interpreter, an element tt e L a view element 
and the set W the view element value set. For a set S C S* , we call the S x L 
matrix defined by {s , 7t) = y(s,7r) for s e S' and tt £ L the data matrix 

of S under the view M. 

In the above definition, we have not exactly specified what kind of values are 
considered as view element values. They can be boolean values, real numbers, 
strings, etc. The first reason why we defined a view as an algorithmic process 
is that we can deal with various views flexibly. The second reason is that the 
time-series process of V on (s, tt) gives the interpretation of the sequence s by 
the view element tt that provides a bridge between view design and visualization 
of hypotheses. 

Given a collection Sq, . . . , Sm-i of sets of sequences over S, the data ma- 
trices , . . . , S^_ ^ are used by a hypothesis generator for producing a hy- 
pothesis for ^o, . . . , Sm-i which is represented in terms of the view elements of 
M = {V, L). For example, when we want to discover an explanation discriminat- 
ing two sets 5'o and S\ of positive and negative examples, respectively, the data 
matrices and will be constructed for a hypothesis generator to create a 
hypothesis described with view elements in L. 

View 1 Alphabet Indexing and Regular Patterns. We presented a system BON- 
SAI [14], which is designed for discovering classifications of amino acid residues 
automatically from positive and negative examples. BONSAI has a parameter 
which specifies the number k of categories into which the symbols describing 
the original sequences are classified. In the case of amino acid sequences of pro- 
teins, we set S to be the alphabet consisting of 20 amino acid residues. Let 

= {0, 1, . . . , k — 1} when the parameter k is chosen. An alphabet indexing is 
a mapping : V — > which classifies the symbols in S into k categories. 

Given two sets Sq and of strings over S, BONSAI discovers an alphabet 
indexing i/'fc and a small decision tree T whose internal nodes are labeled with 
patterns of the form tt = xwy {w £ and leaves are labeled with 0 or 1. 
We denote by L[T) the set of strings in classified as 0 by T. The accu- 
racy of the hypothesis (T, ip) is defined by ■ where 

ipliai ■ ■ ■ a„) = ■ ■ ■ V’fc(an) for a, £ A. Then a view employed by BONSAI 

is defined as follows: Let be a polynomial-time algorithm which decides if 
for given s £ A* and a pattern tt = XiWX 2 with w £ A^, w occurs on '>p'p{s), and 
shows all occurrences of w in ip^{s) on the corresponding locations in the string s. 
Let Lfc = {xiwx 2 I w £ A^}. A view = [V.^^,Lk) over A-sequences is em- 
ployed by BONSAI. BONSAI could discover a biologically meaningful knowledge 
from 689 transmembrane domain data and 19256 non-transmembrane data [14]. 

View 2 Approximate String Matching. For an integer fc > 0, let = {{w,k) 
w £ A*}. Let A be a polynomial-time algorithm which decides if, for a string 
s £ A* and {w, k) with tc £ A* and fc > 0, s contains a substring whose edit 
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distance from w is at most k and shows all such substrings. Then APPROX^ — 
(A,Xk) is a view over i7-sequences. We call APPROX'^ the k-mismatch view. 
Wc can also consider a range on |u;j, c.g., p < \w\ < q, to define a new view. 

View 3 PROSITE View. Let E be the set of all prosite patterns in PROSITE [3] 
and let i? be a prosite pattern matching algorithm. Then PROSITE— [P. E) 
provides a view over amino acid sequences. We call this view the prosite view. 

View 4 Range Restrictions. Let 1 < i < j be positive integers. For a string 
s — s[l]s[2] ■ ■ ■ s[n] in X* with s[i] G 17 for 1 < i < n, let s[z, j] = s[i*] • ■ ■ s[j*] and 
s[—j, —i] = s[n — j* + 1] • • ■ s[n — i* + 1], where i* = min{i, n} and j* = min{j, n}. 
APPROX%j] = {A[t,j],Xk) {APPROX>^[-j,-i] = {A[-j, -t], Xk)) is a view 
given by the view interpreter A[i,j] which runs A with 

(s[— J, — i]) instead of s. Similarly, PROSITE[i, j] and PROSITE[—j,—i] 
are given by restricting the range of search to s[i,j] and s[— j, — i], respectively. 

2.2 Searching View Space 

A view gives us a method of interpreting sequences. In the process of discovery, 
however, one of the most important aspects is a discovery of a new view on 
data since a choice of a better view may lead to a better understanding of data. 
Considering this aspect of discovery, we define a view space and shows some 
examples which are implemented in Genomic Hypothesis Creator. 

We use a simple abstraction of search strategies. For a set N called a search 
space, a search strategy a over V is a procedure that specihes the start element r'o 
in N and determines the next element in N to visit for any v E N when 

an arbitrary score function ip : N —>■ R is given to a in the sense that a can use 
the value <p{fi) for any fi G TV, where R is the set of real numbers. 

Definition 2. A view .space over A-sequences is a pair (A4,cr), where M = 
{TW,y}i^gjv is a collection of views over A-sequences indexed by TV and cr is a 
search strategy for TV. When a is clear from the context, we simply denote the 
view space by A4. 

Let A4 = {TWi^}i/gjv be a view space over A-sequences with a search strat- 
egy a. For a view Mi, in A4, data sets So, ... , Sm-i of A-sequences are trans- 
formed to data matrices for a hypothesis generator Q to generate 

a hypothesis h which explains So, ... , Sm-i- Wc assume a score function ip that 
measures the “goodness” of the hypothesis. On the other hand, we can regard this 
score function as a score function given to the view Mi, for the data So, ... , Sm-i- 
In this way we make a connection between a view space and a hypothesis genera- 
tor. Since the definition of the score function for is a matter which is dependent 
on Ti itself, we shall not discuss this matter further. 

The following view spaces are realized in Genomic Hypothesis Creator. 

ViewSpace 1 Alphabet Indexing and Regular Patterns with Local Search Strat- 
egy. BONSAI uses the view space where (A^)^ is the set 
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of mappings ■. S ^ For a fixed view BONSAI with Sq 

of positive examples and iSi of negative examples generates a hypothesis by us- 
ing Sq and . A hypothesis is represented as a small decision tree. By 
employing the accuracy of the decision tree as the score function, it starts with 
a randomly generated view and changes the view by local search and produces 
a hypothesis which attains a local maximum. In this way, BONSAI searches the 
view space for a hxed parameter k which is specified by the user. 

ViewSpace 2 Approximate Matching with Exhaustive Search. The view space 
= {APPROX^[iA] I 1 < * < i} has numerical parameters i and j. 
When we can assume some rational properties on the score function ip of the 
hypothesis generator Q, some efficient algorithms are known for solving this 
optimization problem [8]. As long as we cannon assume any specific proper- 
ties on the score function, we employ the exhaustive search strategy for find- 
ing an optimal interval I = [i,j] by bounding j — i hy some constant. Sim- 
ilarly, we define view spaces AX^[i,*] = {APPROX’^[i, j] | 1 < i < j} and 
AX^[*^j] — { approx'^ [ i,j] I 1 < i < j}. Wc also consider the view space 
AX* = {APPROX'^ I k > 0} over A-sequences with an exhaustive search strat- 
egy- 



2.3 Operations on Views 

In view design, it is convenient to combine several views to define a new view. 
View Designer of Genomic Hypothesis Creator is designed to have an ability to 
combine several views. 

Definition 3. Let Mi = {Vi, Li) be a view over A-sequences with a view element 
value set Wi for 1 < i < L We assume Li Lj =0 for i ^ j. Let be an 
algorithm such that on (s,7t) simulates Vi on (s,7t) if tt belongs to L*. Then 
we dehne Mi + • • • + M; = (H+, L+), where = Li U • • • UL(. Furthermore, we 
define a view Mi x ■ ■ ■ x Mi = {V^ , Li x ■ ■ ■ x Li), where V^ be an algorithm 
such that V^ on (s, (tti, . . . ,7Tj)) runs Vi on (s,7Tj) for all 1 < i < Z and outputs 
(Vi(s,7ri), . . . , Vi(s,7T()). 

Let M. = and Ai' = be two view spaces with search 

strategies a and a' , respectively. It is also convenient to create a new view space 
by combining view spaces (Af,cr) and {M' ,a'). This view space shall consist of 
M + M’ = {Mu + Af' (oi' ^ = {Afi^ X and 

a new search strategy a" for N x N' . However, it is not possible to define a" in 
a uniform way. 

Genomic Hypothesis Creator employs two kinds of search strategies; local 
search and exhaustive search shown in ViewSpace 1 and ViewSpace 2, respec- 
tively. Fortunately, we can define a search strategy in a natural way for any 
combination of exhaustive search and local search. In this way. Genomic Hy- 
pothesis Creator is designed to allow combinations of views and view spaces. 
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2.4 Visualization of Views and Hypotheses 

Visualization of hypotheses is also an important problem in the discovery process 
for understanding hypotheses. 

We defined a view over sequences as a pair M — {V, L) of the view inter- 
preter V and the set L of view elements. Then a hypothesis generator produces 
a hypothesis which is represented by using view elements in L. For example, if 
a hypothesis is described as a binary decision diagram with view elements as 
decision rules, then, in addition to the visualization of the diagram itself, we 
need to visualize the view elements on the nodes for better understanding of the 
hypothesis. 

In Genomic Hypothesis Creator, the view interpreter is used for visualizing 
view elements in hypotheses. Given (s, vr) in S* x L, the view interpreter on 
(s, it) exactly shows as an algorithmic process how the string s is viewed by 
the view element tt. The views pre-installed in View Designer equip with view 
interpreters which animate this algorithmic process. 

A view provided by a user can be also used for visualization if the user uses a 
view interpreter provided by View Designer or provides a view interpreter with 
a capability of visualization in a specified format. 

In this way, a strong association of views and hypothesis visualization is 
realized in Genomic Hypothesis Creator. 



2.5 View Designer of Genomic Hypothesis Creator 

Genomic Hypothesis Creator deals with DNA sequences of genes of various or- 
ganisms. We shall now describe views and view spaces which are implemented 
in Genomic Hypothesis Creator. 

Organisms are roughly classified into two classes, prokaryotes and eukaryotes. 
E. coli is a prokaryotic organism which has a single circular chromosome and 
its DNA is described as a single DNA sequence of size about 4.6Mbp. In a 
prokaryotic DNA sequence, functionally related genes are coded in a series called 
an operon and will be transcripted at one time. Intuitively, a region which is 
concerned with one transcription process has a structure shown in Fig. 2. A 
prokaryotic transcription unit contains one or more coding regions, each of which 
encodes one protein. We assume that locations of coding regions are specified on 
the sequence. 

Eukaryotes have a different structure of genes. A gene consists of exons and 
introns and encodes one protein (Fig. 3). In the process of splicing, introns are 
removed and only exons are concatenated to produce a sequence which encodes 
a protein. A single cell eukaryotic organism S. cerevisiae has about 6,000 genes. 
Most of the genes of S. cerevisiae do not have any introns. However, the genes 
of Homo sapiens have a large variety in the number of exons (introns) and their 
lengths. 

The locations of genes on DNA sequences are either tested by experiments or 
predicted by some softwares [19]. Databases [20] are constructed based on such 
information. 
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Fig. 2 . Structure of prokaryotic genes. Patterns on the sequence are not unique. 
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Fig. 3. Structure of a eukaryotic gene. 



As a template, View Designer assumes on a DNA sequence the specific regions 
and locations, such as the transcription startsite, the start point of ORF, exon, 
intron, ORF, etc, which are shown in Fig. 2 and Fig. 3. Further, the translation 
of ORF to an amino acid sequence is also attached to the region of an ORF on 
the DXA sequence. 

The 5' region of the transcription startsite is called the upstream while the 3' 
region is called the downstream. How to set the length of the upstream and the 
length of the terminator region becomes a matter of view design. 

The View Designer of Genomic Hypothesis Creator is developed with the 
following principle so that it can cope with a diversity of requirements. 

(VI) User can place views APPROX^ , APPROX^[i, j], APPROX^[—j, —i\ on any 
regions by specifying parameters k, i, and j. 

(V2) User can place views PROSITE, PROSITE[i, j], PROSITE[—j,—i] on the 
amino acid sequence located on the region of ORF by specifying the param- 
eters. 

(V3) User can place view spaces AX^[*, *] and any k specified by the user. 
We call these view spaces the atomic view spaces. 

(V4) User can use user’s own view by plugging-in the view M = (U, L) in a 
specified format. We assume the set L of view elements is finite. User can 
also employ the view interpreters provided by View Designer to design user’s 
view by defining a finite collection L of view elements. 

(V5) User can combine arbitrary views to create a new view. 

(V6) User can combine a view and a view space to create a new view space. 

(V7) User can combine two view spaces to create a new view spaces under the 
condition that at most two atomic view spaces are included. 
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3 Data Collection from Text Databases 

Genomic Hypothesis Creator employs SIGMA[1] for collecting data from 
databases. SIGMA is a general purpose text database management system that 
realizes very precise and fast one-way sequential processing of text files. The pat- 
tern matching algorithm [2] implemented in SIGMA can handle ten thousands 
of keywords simultaneously in one search both time and space efficiently. In ad- 
dition to conventional keyword search, SIGMA realizes very fine searches and 
replaces which cannot be done with usual information retrieval systems which 
use inverted files. In SIGMA, we can define records on a text in almost arbitrary 
way by specifying some strings as record delimiters. This allows us to handle 
various kinds of text data. 

For example, SIGMA can collect records which contain a sequence with seg- 
ments cgatgacc, tagatt, taatgagttgg occurring in this order by a single search. 
It is also possible to collect records such that a given keyword repeats more than 
four times. For example, records with tandem repeats of a specific segment are 
collectable. For a given file I of keywords, it is possible to collect all records which 
contain at least one keyword in I through one text search. Furthermore, these 
searches can be combined into one search over the text. Collected records can be 
refiled into another format in a very flexible manner so that the refiled data can 
be used for another analysis. The text search principle realized in SIGMA makes 
such fine operations feasible. Genomic Hypothesis Creator employs SIGMA for 
searching and refiling. With this system, by using keywords in annotations, se- 
quence segments, etc, user can easily collect segments of DNA sequences of target 
genes in a user specified format. The collection of data shall be transferred to a 
hypothesis generator through View Designer. 

Various genome databases [20] also provide very useful information retrieval 
tools based on inverted files. However, they do not include such an ability of 
SIGMA that is most suited for our purpose. 

4 Hypothesis Generator 

The idea of multistrategy principle has been extensively discussed as an impor- 
tant aspect of knowledge discovery system [11,12,17,18]. Genomic Hypothesis 
Creator is designed to allow user to seleet a hypothesis generator from the pool T~L 
of hypothesis generators, and enables us to plug-in an external tool in a specific 
format into 7i without redevelopment of the system core. Such a flexibility is 
also realized in a data mining system Kepler [18], which produces decision trees, 
neural networks, etc. In contrast with Kepler, Genomic Hypothesis Creator au- 
tomatically repeats the generation of hypotheses according to views determined 
by a search strategy of a view space. 

We consider a hypothesis generator as an algorithm Q such that, given the 
data matrices , . . . , of So, ■ ■ ■ , Sm-i Q under a view M, it generates 

an expression representing a hypothesis h on Sq, , Sm-i m terms of the view 
elements of M. A score function ip is associated with the hypothesis generator 
in order to evaluate hypotheses generated by Q. 
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The following hypothesis generators are implemented in the current proto- 
type system of Genomic Hypothesis Creator. 

Decision Tree The hypothesis generated in the BONSAI system is a small 
decision tree as is mentioned in ViewSpace 1. A hypothesis generator Qb of 
BONSAI executes the following: Given sets Sq and Si of positive and negative 
examples, Qb chooses randomly small sots P and N of positive and negative 
examples from Sq and Si, respectively. Then BONSAI generates a decision tree T 
by using the algorithm of ID3 [13] which discriminates P and N perfectly. The 
score function is given as the accuracy function given in ViewSpace 1. The 
size \P\ (rosp. \N\) is called the window size of positive (rosp. negative) training 
examples. In BONSAI, the window size is set from 5 to 20 in order to keep T 
small. The size of decision tree can be controlled by the window sizes. This 
hypothesis generator has a special sense in knowledge discovery since the sets P 
and N used for constructing a decision tree can be regarded as representative 
examples of the hypothesis. 



Binary Decision Diagram Binary decision diagrams (BDDs) are a useful rep- 
resentation of Boolean functions, which arc regarded as directed acyclic 
graphs [5]. 

We have constructed a hypothesis generator Gbdd for binary decision dia- 
grams in the following heuristic way: Given data matrices S^ and Sf^ , Gbdd 
makes a decision tree T by in the same way as BONSAI except that in this case, 
instead of randomly chosen small sets P and N, all examples in Nq and Si are 
used for generation of a decision tree. This produces, in general, a large tree 
which classifies Sq and Si as optimally as possible. Then Gbdd transforms the 
resulting tree T to an equivalent BDD by employing the method in [5]. This 
reduces the number of nodes to some extent if the same view elements occur 
in the tree. The score function ipsDD is defined by the number of nodes of the 
BDD. The resulting BDD itself may provide an interesting knowledge if view 
elements are interpreted visually. Further, when a view space is placed on data, 
if we could discover a view under which the score of the resulting BDD is locally 
optimum, the view can provide a good understanding of the data So and Si and 
can be regarded as a kind of discovery. It is not hard to extend this hypothesis 
generator for a A;-decision diagram with k > 3. 

Conclusion 

Our motivation of this paper is to contribute to scientific discovery in genomic 
researches by formulating a method of scientific discovery and realizing it as Ge- 
nomic Hypothesis Creator. We have defined new notions of view over sequences 
and view space over sequences that constitute the foundations of the discussion in 
this paper. With them. View Designer is designed for automatically creating new 
views on data, which may give user a key to discovery. User can add user’s own 
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views to the system through a plug-in interface. Currently the view space 
used in the BONSAI system and the view space *] in ViewSpace 2 are 

available. 

In Genomic Hypothesis Creator users can take multistrategies for discovery 
by selecting hypothesis generators from the pool of hypothesis generators. The 
system also supports to plug-in external tools of hypothesis generators into the 
pool, which extend the capabilities of Genomic Hypothesis Creator in a specific 
way. In the process of searching a view space (AI, cr), a view of M. selected by a 
is linked to a hypothesis generator which is selected from the pool of hypothesis 
generators. In this way Genomic Hypothesis Creator offers a very wide range of 
methods for scientific discovery from text databases. 

It has been scheduled to carry out experiments, we will report experimental 
results with further discussions on theoretical aspects. 
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Abstract. This paper describes a method for supporting knowledge 
evolution and facilitating awareness in a community at the same time. 
We propose two ideas. One is associative representation for facilitating 
externalization of both personal and community information. The asso- 
ciative representation links heterogeneous information without defining 
the semantics strictly. We leave the interpretation of the semantics to 
human background knowledge. The other is visualization of information 
interaction in a community using talking-alter-egos metaphor. Taking- 
alter-egos metaphor mimics a salon in which alter-ego representing each 
community member interact with each others, thereby the community 
member can see how their own or others’ knowledge interact. 

We have developed a called CoMeMo-Community that pursue collab- 
orative story generation based on the talking-alter-egos metaphor. We 
investigated how far people can exchange ideas with associative repre- 
sentation and how people react the talking-alter-egos metaphor. 



1 Introduction 

Thought all people are autonomous information processing, they create their 
knowledge not by themselves but by interacting others, such as persons, articles 
and so forth. It will be necessary to interact other persons by communication 
in order to create more organized knowledge especially. People may make and 
join a group, when they collaborate with each other to solve common problems 
or fulfill common interests. A community is a group of persons with common 
interests. 

The first step of spontaneous communication in a community is what one is 
aware of other’s state, for example, who knows what, what are common interests 
and so forth. It is called awareness [1]. Communication has not only function 
of conversation but also role of knowledge interaction. The more awareness of a 
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community is enhanced, the more knowledge interaction is active. But personal 
information tends to be heterogeneous and its structures are ambiguous. These 
will prevent community members from commimieating and eollaborating with 
each other. Such heterogeneous information should be exchanged and shared 
more freely. 

When people consider and summarize their own ideas, they usually prefer to 
be alone. But people want to contact with others in order to find a clue, when 
their thoughts or ideas come to a deadlock once. The human knowledge creation 
is practiced by studying by themselves and exchanging opinions with each other. 
People evolve their knowledge with combining personal and others knowledge in 
turn. 

Our research aims to support for evolution of both personal and community 
knowledge. We proposed two ideas. One is associative representation for facili- 
tating externalization of both personal and community information. The other 
is visualization of information interaction using talking-alter-egos metaphor. We 
developed a system called CoMeMo-Community based on these two ideas. 

In the next section, we describe overview of CoMeMo-Community. In sec- 
tion 3, we propose associative representation that facilitates externalization of 
human knowledge and describe how people generate and understand it. In section 
4, we describe alter-ego metaphor that visualize community knowledge interac- 
tion and report how people react it. 

2 CoMeMo-Community : A System for Supporting 
Community Knowledge Evolution 

2.1 Overview of CoMeMo-Community 

CoMeMo-Community is a system designed to support community knowledge 
evolution by enhancing community awareness. This system based on the follow- 
ing two ideas. 

Associative representation : to facilitate externalization of the both per- 
sonal and community knowledge 

Talking-alter-ego metaphor : to visualize community members' knowledge 
interaction using human alter-ego 



Figure 1 is the overview of CoMeMo-Community. Community member ex- 
ternalize their own knowledge using associative representation and store it as 
personal knowledge. Uploading each stored personal knowledge, alter-egos rep- 
resenting each community member interact with each others on the conversation 
place. This knowledge interaction is visualized by talking-alter-ego metaphor and 
generates new knowledge. The generated knowledge is extracted as community 
knowledge and shared in a community. By setting in motion and observing vir- 
tual conversation among alter-egos of herself /himself or/and others, the user is 
facilitated awareness of community. 
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Fig. 1. Overview of CoMeMo-Community 



2.2 Cycle of Community Knowledge Evolution 

This system provides the user with two phases. One is personal phase for main- 
taining personal knowledge with associative representation. The other is com- 
munity phase for observing community knowledge interaction using taking-alter- 
egos metaphor. With coming and going between personal and community phase, 
the user evolve knowledge. The cycle of community knowledge evolution is shown 
in Figure 2. 




Fig. 2. Cycle of community knowledge evolution 



When users join a virtual community for the first time, they must register 
alter-ego who retains their knowledge in order to be free to summon it. Thus one 
may run the virtual conversation as many times as he/she wants at any time. 




120 



Takashi Hirata et al. 



and observe what happens. The users update their own personal knowledge using 
associative representation in everyday life , for example, when the ideas occur to 
them. Because personal knowledge is updated day by day, one may see different 
conversation at each time even if the same set of keywords given. The users take 
useful information on the conversation place back to personal knowledge and 
reuse it. Thus the users evolve their knowledge with combining personal and 
others knowledge in turn. 



3 Maintaining Personal Knowledge with Associative 
Representation 

CoMeMo-Community supports maintaining personal knowledge by providing 
the user with facilitates of aggregating, browsing, editing, refining associative 
representation. The semantics of associative representation itself is left open. 
Associative representation permits raw information materials to be accumulated 
with minimal overhead. In return, interpretation of such information heavily 
relies on background knowledge. 

3.1 Associative Representation 

In the following, we set a hypothesis that connecting information without defin- 
ing the semantics using the associative representation is effective to handle a 
large amount of heterogeneous information. Associative representation is many- 
to-many hyper-link associating one or more key unit with one or more value 
unit. In our approach, the semantics of the associations is not defined strictly. 
Instead, we leave the interpretation of the semantics to human tacit background 
knowledge. This facilitates the acquirement of information from a variety of data 
(e.g. ideas, texts, images). Example of associative representation is illustrated in 
Figure 3. 




(c) multimedia 




(d) story, structure 
Heading 



Introduction 

Salutation 



Joly 

CO alimentary cosing 
Signature 

i 

Postscript 



Fig. 3. Example of associative representation 
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— Free Reminding 

Figure 3(a) denotes that from given concepts “Nara”,one may be reminded 
of “Nara Park”, “Todai-ji Temple”, and “School Excursion”. Ten persons 
have ten different free reminding. 

— Attribute- Value 

Figure 3(b) denotes that “Todai-ji Temple” is reminded when “Nara” and 
“Temple” are given as keys. Association in this kind of collections are value- 
attribute representation. 

— Multimedia 

Reminded things are not only concepts but also images or texts. Figure 3(c) 
denotes that from given concepts “Kofuku-ji Temple”, one may be reminded 
of “Picture of Kofuku-ji Temple” and “Explanation of Kofuku-ji Temple” . 

— Story, Structure 

Associations is able to show work-flow or story line too. Figure 3(d) denotes 
form of letter. 



We call a set of associations collected from a particular point of view work- 
space. Any data items are represented by icons called unit on a workspace, e.g., 
concepts, texts, image files and so on. workspaces can be nested. A workspace, 
presented as icon, can be placed in another workspace. 

User externalize own knowledge using associative representation and store 
as personal knowledge on the workspace. Example of personal knowledge about 
fishing is shown in Figure 4. User is able to add, delete, and edit units and 
reorganize its associations optionally by mouse handling. In human sight user 
access own personal knowledge through workspace. 




Fig. 4. Example of personal knowledge on workspace 
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3.2 Experiment - Generating and Understanding Associative 
Representation - 

In this section, we report how people generate associative representations, and 
how people understand their semantics. 



Method 

Apparatus 

CoMeMo^ 

Subjects 

One Ph.D. students, four 2nd-year M.Eng. students, and one staff in our 
laboratory. 

Procedure 

(1) The subjects were trained how to generate associations (students only). 

(2) They generated associations reminded by a keyword “agent” on the 
CoMeMo freely (students only). 

(3) They were shown associations generated by other subjects and answered 
the following questions: 

— “Do you understand what are written ?” 

— “Do you identify who wrote this associative representation ?” 

— “If you identify who, why ?” 

— “Say anything you felt in this experiment” 

(students only) 

(4) The same as (3). (the staff subject only) 



Results and Discussion An example of associative representation generated 
by the Subject A is shown in Figure 5^ 

All subjects generated associations within 30 minutes. We analyze that adults 
who have computing skills can generate associations without difficulty. 

All subjects understood the meaning of associations generated by others. 
Concerning an associative representation generated by the subject C, all other 
subjects identified that it was made by him. We analyze that ideas can be trans- 
mitted using associations among people who share knowledge. All subjects ex- 
cept for the subject C laughed when they saw screens wrote by the subject C. 
80% (4 out of 5 student subjects) said that they had some fun during the exper- 
iment. The staff subject said that “I can assume the subject’s knowledge level 
concerning research topics” , “I may want to ask a report for this subject” , “I 
want to talk to this subject, because hc/shc may be interesting”, and so on. We 
think that transmitting ideas using associations between groups leads to know 
people and therefore facilitates for human communication. 

^ CoMeMo is a system for integrating heterogeneous information using associative 
representation [2], 

Concepts were originally written in Japanese. 
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4 Visualizing Community Knowledge Interaction Using 
Taking- Alter-Egos Metaphor 

CoMeMo-Community provides users with visualizing community knowledge in- 
teraction using taking-alter-egos metaphor. Alter-egos behaves on behalf of the 
user on the conversation place. By observing this behavior, the user learn lots 
of things about the community. 



4.1 Talking- Alter-Egos Metaphor 

Talking-alter-egos metaphor consists of two components. One is an alter-ego that 
keeps the externalized knowledge of person. The other is a conversation place 
where alter-egos make utterances in turn. 

The example of utterance mechanism is shown in Figure 6. In the beginning 
the user must put one or more keyword as topic of utterance on the conversation 
place and make them active. Now “Nara” and “Temple” are given as keywords 
and activated (Figure 6(a)). Each alter-ego monitors keyword on the conversa- 
tion place whether they have information related to it in their knowledge. If 
there are something information to relate, the altcr-cgo links the information 
as new keyword by copying it from her/his knowledge. The new keyword is 
add to the right of the original keyword using associative representation. Alter- 
ego-A find out new keyword “Todaiji-temple” and linked it with “Nara” and 
“Temple” (Figure 6(b)). Then, the original keywords “Nara” and “Temple” are 
deactivated, the new keyword “Todaiji-temple” is activate. Each agents put to 
the information related to “Todaiji-temple” in a row (Figure 6(c)). Alter-ego-A 
put the new keywords “Nandai-mon Gate” and “Daibutsu” that are structures 
of Todaiji-temple. Alter-ego-B put the new keyword “Shuni-e Ceremony” that 
is event at Todaiji-temple. Such activation cycle will continue. Thus, alter-egos 
collaborate with each other to generate a story by alternately reproducing infor- 
mation fragments from its knowledge. 
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(a) First topics "Nara" , "temple" 





Fig. 6. Mechanism of taking- alter-egos metaphor 

4.2 Story Generation by Talking- Alter-Egos Metaphor 

Example of actual story generation on CoMeMo-Community shows Figure 7. 

When user enter one virtual community, all registered alter-egos in it appear 
on the conversation place firstly. Each alter-ego is represented by each community 
members’ image. There are two ways of choosing alter-egos. One is selection from 
a list of registered alter-egos and/or pointing to the alter-egos’ image directly by 
user (Figure 7(a)). The other is the way that alter-egos who have information 
related to given keyword are chose by system. 

In this case, three alter-egos are chose and put to the left of the conversation 
place. User put a keyword as first topic. Each alter-ego react by throwing back its 
own knowledge for a given keyword. As interaction among alter-egos continue, 
the story grows gradually. The original keyword was “Alcohol Drinks” (Fig- 
ure 7(b)), which reminds “Wine” (Figure 7(c)), which in turn reminds “Pasta”, 
“Italian Food”, resulting in “Word-of-Mouth information” (Figure 7(d)). 

4.3 Analysis of the Conversation among Alter-Egos 

Virtual conversation by alter-egos generates various stories. Story generation is 
influenced by a variety of factors such as choosing alter-egos, given keywords. 
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the maximum keyword that an alter-ego can put on the conversation place and 
so on. 

Although mechanism of this virtual conversation is very simple, lots of inter- 
esting phenomenon are observed. Such conversation among alter-egos looks like 
conversation by humans. 

We show interesting conversation among alter-egos in the following examples. 

A typical example is shown in Figure 8. Figure 8 represents that alter- 
egos representing a staff and three students make conversation about laboratory 
project. The alter-ego of staff talks about main theme and research of project. 
Each alter-ego of stndents talk about their system in detail. This conversation 
is typical type of knowledge sharing . 

Figure 9 shows the example of topic shift during conversation. At first, each 
alter-ego talk about “international conference” that is given as first topic. But 
one of them begins to talk about “ruins” at the topic “Mexico” as turning point. 

Figure 10 shows the example of topic shift by a homonym. At the beginning 
of conversation, lower two alter-egos talk about “fishing” that is given as first 
topic. Each alter-ego begin to talk about “baseball field” and/or “fishing spot” 
respectively as soon as keyword “Home Ground” is put on the conversation 
place. Such phenomenon happens in real conversation. 
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Fig. 8. Example of conversation 







Fig. 9. Example of topic shift 

4.4 Experiment - Reaction to Talking-Alter-Egos Metaphor - 

In this section, we report how people react to the talking-alter-egos metaphor. 

Method 

Apparatus 

CoMcMo-Community 

Subjects 

Four 2nd-year M.Eng. students, and one Ist-year M.Eng. students in our 
laboratory. 

Preparation 

We created 14 alter-egos that represent a staff and students in our laboratory 
on CoMeMo-Community. 
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Fig. 10. Example of topic shift by a homonym 



Procedure 

(1) The subjects chose interesting alter-egos and put keywords on the con- 
versation place. 

(2) They were shown demonstration of CoMeMo-Community using choosing 
alter-egos and keywords by them and answered the following questions : 

— “Could you understand the information displayed on the conversa- 
tion place ?”^ 

— “What do you find out from this demonstration ?” 

— “What do you think about that your information is opened to public 
with your image ?” 

— “Please feel free to comment about this system.” 



Results and Discussion The reaction was all favorable. 

All subjects were able to understand what is meant by associative representa- 
tion. But the more associative representation increase, the harder most of them 
can understand it, because they are confused by overlapped units and links. We 
analyzed that adults who don’t know about associative representation at first 
can mostly understand it. We also think that it is necessary to share background 
knowledge for the purpose of more comprehend associative representation. 

All subjects said that it is easy to find out common topics and interests in this 
system because information interaction is shown visually and objectively. This 
suggests that this system is very effective for discovering helpful information in 
order to make contact with others in a community. 

One subject said that “Why he has such information ? I’d like to talk with 
him personally.” . Thus this system enable to facilitate communication not only 
on virtual community but also in real world. 

^ All subjects aren’t knowledgeable about associative representation 
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Subjects all said that they would publish their own knowledge if it contributes 
to their community and the use of the facial image is effective in demonstration, 
in particular when the observer is acquainted with the person herself/himself 
each alter-ego represents. 

As a result of investigation, we realized that our system enables community 
members to facilitate communication by sharing each member’s information and 
enhancing awareness of a community. 



5 Conclusion 

We proposed the associative representation to integrate heterogeneous infor- 
mation such as static information (c.g. local sites information) and dynamic 
information created in word-of-mouth communication. The associative repre- 
sentation connect various information without defining the semantics strictly. 
We investigated how people generate and understand the associative represen- 
tation. We found that ideas can be transmitted using associative representation 
among people who share knowledge. 

We developed a system called CoMeMo-Community which support knowl- 
edge evolution in a community. This system visualizes community knowledge 
interaction and facilitates community awareness using alter-egos metaphor. We 
investigated the effectiveness of this system in our laboratory. The results sug- 
gest that the system enables community members to facilitate communication 
by sharing each members’ information. 

As a future research. We plan to apply this system to more large community 
over a network practically and evaluate the usefulness of the system. 
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Abstract. Waka is a form of traditional Japanese poetry with a 1300- 
year history. In this paper, we attempt to discover characteristics com- 
mon to a collection of waka poems. As a formalism for characteristics, we 
use regular patterns where the constant parts are limited to sequences 
of auxiliary verbs and postpositional particles. We call such patterns 
FUSHI. The problem is to hnd automatically significant FUSHI patterns 
that characterize the poems. 

Solving this problem requires a reliable signihcance measure for the pat- 
terns. Brazma et al. (1996) proposed such a measure according to the 
MDL principle. Using this method, we report successful results in finding 
patterns from hve anthologies. Some of the results are quite stimulating, 
and we hope that they will lead to new discoveries. Based on our experi- 
ence, we also propose a pattern-based text data mining system. Further 
research into waka poetry is now proceeding using this system. 



1 Introduction 

Waka is a form of traditional Japanese poetry with a 1300-year history. Most 
WAKA poems are in the form of TANKA, namely, they have five lines and thirty- 
one syllables, arranged thus: 5-7-5-T-7. This poetry was usually composed in 
momentary flashes of inspiration. Most frequently, it was used as a subtle means 
of communication between lovers and friends, and was therefore an important 
part of daily life in ancient Japan. Waka poetry has been central to the history of 
Japanese literature, and has been studied extensively by many scholars. Recently, 
since an accumulation of about 450,000 WAKA poems became available in a 
machine-readable form, it is expected that computers will play an important 
role in research into WAKA poetry. 

In this paper, we focus on the problem of discovering characteristics common 
to a collection of WAKA poems. As a formalism for characteristics, we use the 
class of pattern languages introduced by D. Angluin [2]. One of the most impor- 
tant subclasses is the class of regular pattern languages, in which each variable 
symbol appears only once. This subclass is sufficiently rich from a practical view- 
point [9]. To characterize WAKA poems, we limit the constant parts of regular 



S. Arikawa and H. Motoda (Eds.): DS’98, LNAI 1532, pp. 129-141, 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 



130 Mayumi Yamasaki et al. 



patterns to sequences of adjuncts, i.e., auxiliary verbs and postpositional parti- 
cles. For example, we consider patterns such as *BA+ZARAMASHlO *, where BA 
is a postpositional particle, and ZARAMASHIO is a chain of two auxiliary verbs 
ZARA and MASHI and a postpositional particle O. This pattern corresponds to 
the subjunctive mood. We call such patterns FUSHI, and call their constant parts 
adjunct sequences. This limitation of the constant parts to adjunct sequences is 
essential in our characterization. It should be noted that a .Japanese sentence 
is a sequence of segments, each of which consists of a word and its subsequent 
adjuncts. The FUSHI pattern is a reliable model for techniques used in composing 
WAKA poems, and is closely related to their structure and rhythm. 

The goal of this paper is to discover, more or less automatically, significant 
FUSHI patterns that characterize a given set of WAKA poems, and then report 
some features that may be due to the times or to the poets’ personalities. The 
difficulties are summarized as follows. 

— To identify adjuncts appearing in WAKA poems. 

— To give an appropriate definition of the significance of FUSHI patterns. 

Since we have no delimiters between segments in the Japanese language, the first 
item requires morphological analysis of WAKA poems. However, many ambigui- 
ties, which are not easy to resolve, will arise during this analysis. In this paper, 
we assume that any substring identical to an adjunct is the adjunct, and there- 
fore we have only to perform pattern matching. This assumption simplifies the 
discussion: Let S be an alphabet, and let * ^ if be the gap symbol. A pattern 
is a nonempty string over if U {>i}. We say that a pattern p matches a string w 
in if+ if w is obtained by substituting strings in if* for occurrences of * in p, 
respectively. A set U of patterns is said to be a covering of a set S of strings 
in if“*" if, for any string in S, at least one pattern in 77 exists that matches it. 

Let C C if+ be a set of adjunct sequences. A FUSHI pattern is a pattern of 
the form =i!ai*a 2 * ■ • ■ *ah*, where h > 1 and ai, 02 , . . . , ah & C. Our problem is 
then defined as follows. 

Given a finite set S of strings in if + , find the most significant covering 77 

of S that consists of FUSHI patterns. 

Note that, in general, S has infinitely many coverings. If we have an appro- 
priate definition of the significance of such coverings, then we can determine the 
most significant covering according to the definition. However, the problem re- 
mains of defining the significance appropriately. One such definition was given by 
Arimura et al. [3]. A k-minimal multiple generalization (abbreviated to k-mrng) 
of a set S of strings is defined as a minimally general set that is a covering of S 
containing at most k patterns. They showed in [3] that /c-mmg is optimal from 
the viewpoint of inductive inference from positive data based on identification 
in the limit [7]. However, we are faced with the following difficulties: (a) we must 
give an integer k as an upper bound of the size of coverings in advance; (b) a 
polynomial-time algorithm exists for finding a fc-mmg, but it is impractical in 
the sense that the time complexity will be very large for a relatively large value 
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of k\ (c) since a set S of strings has more than one fc-mmg, we need a criterion 
to choose an appropriate one. 

Another definition was proposed by Brazma et al. [•')] . The most significant 
covering of a set S is defined as the most probable collection of patterns likely 
to be present in the strings in S, assuming some simple, probabilistic model. 
This criterion is equivalent to Rissanen's Minimum Description Length (MDL) 
principle [8]. According to the MDL principle, the most significant covering is the 
one that minimizes the sum of the length (in bits) of the patterns and the length 
(in bits) of the strings when encoded with the help of the patterns. Since finding 
the optimal solution is \P-hard, a polynomial-time algorithm for approximating 
the optimal solution within a logarithmic factor is presented. 

In this paper, we use the MDL principle to define the significance of FUSHI 
patterns, and apply the method developed by Brazma et al. to the problem of 
finding signiheant patterns from a set of WAKA poems. The main contributions 
of this paper arc summarized as follows. 

1. A new schema is presented in which the constant parts of regular patterns 
are restricted to strings in a set C. Allowing C to be the set of adjunct 
sequences yields a reliable model for characterizing WAKA poems. 

2. A new grammatical scheme of Japanese language is given as the basis of the 
above characterization. This scheme is far from standard, but constitutes a 
simple and effective tool. 

3. Successful results of our experiment of finding patterns from five anthologies 
are reported, some of which are very suggestive. We hope that they will lead 
to new areas of research. The significance measure for patterns based on the 
MDL principle is proved to be useful. 

4. A text data mining system is proposed, which consists of a pattern matching 
part and a pattern discovery part. Using this system, further research into 
WAKA poetry is now proceeding. 

It should be emphasized that the FUSHI pattern of a WAKA poem is not con- 
clusively established even when determined by non-computer-based efforts. The 
determined pattern may vary according to the particular interests of scholars, 
and to the other poems used for comparison. Our purpose is to develop a method 
for finding a set of significant patterns, some of which may give a scholar clues 
for further investigation. Similar settings can be found in the held of data min- 
ing. In other words, our goal is to develop a text data mining system to support 
research into WAKA poetry. 

Unlike data mining in relational databases, data mining in texts that are 
written in natural language requires preprocessing based on natural language 
processing techniques with some domain knowledge, and such techniques and 
knowledge are the key to success [1,6]. In our case, no such techniques are needed; 
the knowledge used here is merely the set of adjunct sequences, which are allowed 
to appear in FUSHI patterns as constant parts. It may be relevant to mention 
that the third author is a WAKA researcher, the fourth author is a linguist in 
Japanese language, and the first and the second authors are researchers in com- 
puter science. 
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2 Method 

This section shows why the FUSHI pattern can be a reliable model for char- 
acterizing WAKA poems, and presents our grammatical scheme on which the 
characterization is based. 



2.1 Fushi Pattern as a Model of Characteristics 

Consider the problem of finding characteristics from a collection of WAKA poems. 
Most studies on this problem have been undertaken mainly from the viewpoint of 
preference for KAGO words. By KAGO, we mean the nouns, verbs and adjectives 
used in WAKA poems^ . It might be considered that WAKA poems commonly 
containing some KAGO words are on the same subject matter: in other words, 
such characterization corresponds to similarity of subject matter. For example, 
Ki no Tsurayuki (ca. 872-945), one of the greatest of the early court poets, 
composed many WAKA poems on the cherry blossom. It is, however, simplistic 
thinking to conclude that the poet had a preference for cherry blossoms. The 
reason for this is that court poets in those days were frequently given a theme 
when composing a poem. We must assume that the poets could not freely use 
favored words. 

What then should be considered as characteristics of WAKA poems? The 
three poems shown in Fig. 1 are very famous poems from the imperial anthology 
Shinkokinshu. These poems were arranged by the compilers in one section of 
the anthology, and are known as the three autumn evening poems (Sanseki NO 
UTA). All the poems express a scene of autumn evening. However, the reason 
why the poems are regarded as outstanding poems on autumn evening is that 
they all used the following techniques: 

1. Each poem has two parts: the first three lines and the remaining two lines. 

2. The first part ends with the auxiliary verb KERI. 

3. The second part ends with a noun. 

Such techniques are basically modeled by FUSHI patterns, regular patterns in 
which the constant parts are limited to adjunct sequences. 

Waka poetry can be compared to IKEBANA, the traditional Japanese flower 
arrangement. The art of iKEBANA relies on the choice and combination of both 
materials and containers. Limitation on the choice of materials, that is, KAGO 
words, probably forced the poets to concentrate on the choice of containers, 
FUSHI patterns. The FUSHI pattern is thus a reliable model for characterizing 
WAKA poems. 

^ The term KAGO consists of two morphemes KA and GO, which mean ‘poem’ and 
‘word’, respectively; therefore, it means words used in poetry rather than in prose. 
However, here, we mean the nouns, verbs and adjectives used in WAKA poems. 
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#361 (Priest Jakuren) 




Sabishisa wa 


One cannot ask loneliness 


SONO IRO TO SHI MO 


How or where it starts. 


NAKARI KERI 


On the cypress-mountain, 


Maki tatsu yama no 


autumn evening. 


AKI NO YUGURE. 




#362 (Priest Saigyo) 




Kokoro naki 


A man without feelings, 


MI NI MO AWARE WA 


Even, would know sadness 


SHIRA RE KERI 


When snipe start from the marshes 


Shigi tatsu sawa no 


On an autumn evening. 


AKI NO YUGURE. 




#363 (Fujiwara no Teika) 




MIWATASE BA 


As far as the eye can see. 


HANA MO MOMIJI MO 


No cherry blossom. 


NAKARI KERI 


No crimson leaf: 


Ura no tomaya no 


A thatched hut by a lagoon, 


AKI NO YUGURE. 


This autumn evening. 



Fig. 1. The three autumn evening poems from Shinkokinshu; blank symbols 
arc placed between the words for readability. English translations are from [4] . 



2.2 Our Grammatical Scheme 

In the standard framework of Japanese grammar, words are divided into two 
categories: independent words (or simply, words) and dependent words (or ad- 
juncts). The former is a category of nouns, verbs, adjectives, adverbs, conjunc- 
tions and interjections, while the latter are auxiliary verbs and postpositional 
particles. A Japanese sentence is a sequence of segments, and each segment con- 
sists of a word and its subsequent adjuncts. Verbs, adjectives, and auxiliary verbs 
can be conjugated. It should be noted that most of the conjugated suffixes of 
verbs and adjectives are identical to some auxiliary verbs or to their conjugated 
suffixes. If we regard the conjugated suffixes of words as adjuncts, a segment 
can be viewed as a word stem and its subsequent adjuncts. This stem-adjunct 
scheme is far from being standard grammar, but it does constitute a simple and 
effective tool for our purposes. 

We can see that the FUSHI pattern *REBA*koso*KERE* is common in the 
poems in Figure 2. The occurrences of BA and KOSO in these poems are postpo- 
sitional particles. However the occurrences of RE and KERE have more than one 
grammatical category in the standard grammar. In fact, the occurrence of RE in 
each of the first two poems is a conjugated suffix of a verb, while RE in the last 
poem is a conjugated suffix of an auxiliary verb. The occurrence of KERB in each 
of the first two poems is a conjugated suffix of an adjective, while KERB in the 
last poem is an auxiliary verb. Although the occurrences of the FUSHI pattern 
in the three poems are thus different, we intend to treat them as if they were 
the same. Finding FUSHI patterns requires a new grammatical scheme that is 
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different from the standard; our stem-adjunct scheme is appropriate. As stated 
previously, the strings appearing as the constant parts of FUSHI patterns are 
called adjunct sequences. Although an adjunct sequence consists of one or more 
adjuncts, we treat it as an indivisible unit. 

3 Optimal Covering Based on MDL Principle 

This section presents the definition of optimal covering based on the MDL princi- 
ple proposed by Brazma et al. [5] , and then shows an algorithm for approximating 
the optimal covering. 

3.1 Definition 

Let us denote by L{'k) the set of strings a pattern tt matches. Consider a pattern 



and a set B — {ai, . . . , 0 !„} of strings such that B C L(7t). The set B can be 
described by the pattern tt and the strings 



such that ai = 7i,o/3i7i,i • • • li,h-iPhli,h for i = 1, . . . , n. Such description of B 
is called the encoding by pattern tt. We denote by ||o!|| the description length of 
a string a in some encoding. For simplicity, we ignore the delimiters between 
strings. The description length of B is 



(/3i, . . . e A+) 



71.0 7i.i ■ ■ ■ 7i,h 

72.0 72,1 ■ • • l2,h 



7n,0 7n,l ■ ' ' ^n,h 



n h 




Kokinshu #193 (Oe no Chisato) 



TSUKI mi re BA / CHIJI NI MONO KOSO / KANASHI KERE / 
WAGA-MI HITOTSU no / AKI NI WA ARA NE DO. 



Gosenshu #739 (Daughter of Kancmochi no asoni) 



YUSA RE BA / WAGA-MI NOMI KOSO / KANASHI KERE / 
IZURE NO KATA NI / MAKURA SADAME M. 



Shuishu #271 (Minamoto no Shitago) 



Ol NU RE BA / ONAJI KOTO KOSO / SE RARE KERE / 
KIMI WA CHIYO MASE / KIMI WA CHIYO MASE. 



Fig. 2. Waka poems containing pattern *REBA=t=KOSO+KERE*. 
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Let us denote by c(7t) the string obtained from tt by removing all *’s. Assuming 
some symbolwise encoding, we have 

h 

ll“*ll = X] + llc(7r)l|. 

j=i 

The description length of B is then 

n n 

Ikll + “ llcWII) = X] “ (ll^WII ■ 1^1 “ 

i=l i=l 

Let A be a finite set of strings. A finite set Q = {(ttijBi), . . . , (TTk,Bk)} of pairs 
of a pattern Tti and a subset Bi of A is also said to be a covering of A if: 

- C L(TTi) (i = 1, . . . , fc). 

- A = BiU ■ ■ - U Bh- 

- Bi, . . . , Bk are disjoint. 

When the set Bi is encoded by for each i — 1, ... ,k, the description length 
of A is 

n 

M(f2) =^||a,|| -C(f2), 

i=l 

where C{f2) is given by 



k 

C{Q)=Y,(\\c{-K,)\\-\B,\-\\n^\\). 

i=i 

Now, the optimal covering of the set A is defined to be the covering 17 minimizing 
M{Q), or to be the set of patterns in it. Minimizing M(f2) is equivalent to 
maximizing (7(17). 



3.2 Detail of Encoding 

In the above dehnition, the optimal covering varies depending on the encoding 
method. In the coding scheme in [5], the patterns and the strings for substi- 
tutions are coded together with delimiter symbols in some optimal symbolwise 
coding with respect to a probability distribution. Therefore, the formula of (7(17) 
contains parameters that are the occurring probabilities of the delimiter symbols 
and the gap symbol *. 

However, in our case the strings to be coded are of length less than m = 32 
because we deal with only the poems consisting of thirty-one syllables. So, we can 
choose a simple way. We shall describe a string w G B* as the pair of the length 
of w and the bit-string representing w in some optimal coding with respect to 
a probability distribution P over S. In practice, we can take P{a) proportional 
to the relative frequency of a symbol a G A in a database. We denote by £p{w) 
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the length of the bit-string representing w. We also denote by n*(7r) the number 
of occurrences of + in a pattern tt. Assuming a positive number m such that 
|w| < m, wc have 

k 

c'(^) = 

1=1 



where 



u{tt) = £p(c(7r)) — n*(7r) log 2 m + log 2 m, 
v{n) = £p(c(7t)) + n*(7r) log 2 m. 



3.3 Approximation Algorithm 



Since the problem of finding the optimal covering of a set of strings contains as 
a special case the set covering problem, it is NP-hard. Brazma et al. [5] modified 
the problem as below: 



Given a finite set A of strings and a finite set A of patterns, find a 
covering 17 of A in which patterns are chosen from A that minimizes 



They presented a greedy algorithm that approximates the optimal solution. It 
computes the values of 



u{tc) 



v{tt) 

L{tt) n U\ 



for all possible patterns tt at each iteration of a loop, and selects the pattern 
maximizing it to the covering. Here U is the set of strings in A not covered by any 
pattern that has already been selected. The value of M(l7) for an approximate 
solution 17 obtained by this algorithm is at most log 2 |A| times with respect to 
the optimal one. The time complexity of the algorithm is 0(|Z\| ■ |A| • log 2 |A|) 
when excluding the computation of {(tt, L(7t) n A) | tt G Z\}, which requires 

I’^l + 1^1 ■ Sae.4 l“l) time to perform the pattern matching between 
the patterns in A and the strings in A. 



4 Finding Fushi Patterns from WAKA Poems 



This section describes our experiment of seeking FUSHI patterns in a collection of 
WAKA poems. To apply the algorithm of Brazma et al. to our problem, we need 
a way of identifying adjuncts with less misdetections and a formal definition of 
adjunct sequences, which are presented in Sections 4.1 and 4.2. Successful results 
of the experiment are then shown in Section 4.3. 
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4.1 How to Avoid Misdetection of Adjuncts 

Since we identify a string that matches an adjunct with the adjunct, there can 
be many misdetections. To avoid such misdetcctions, we adopted the following 
techniques. 

— We restricted ourselves to the adjunct sequences appearing at the end of lines 
of WAKA poems. Obviously, an adjunct sequence can appear in the middle of 
a line when the line contains more than one segment. However, most of the 
important adjunct sequences related to FUSHI patterns appear at the end of 
lines. 

— A WAKA poem was written in a mixture of Chinese characters and KANA 
characters. The former are ideograph characters whereas the latter arc syl- 
labic characters. An equivalent written in only KANA characters is attached 
to every WAKA poem in our database. Suppose that a line of one poem is 
equivalent to a line of another poem. If we use the one having a shorter KANA 
string at the end of line, then misdetection of an adjunct will be decreased 
because adjuncts are written in KANA characters. Based on this idea, we 
replaced each line of the poem by its ‘canonical’ form. 

Although we cannot avoid all misdetections of adjuncts, our system is ade- 
quate for finding FUSHI patterns. 

4.2 Definition of Adjunct Sequences 

In our setting, the class of FUSHI patterns is defined by giving a set C of ad- 
junct sequences. We therefore need a formal dehnition of adjunct sequences. For 
the definition, we give grammatical rules about concatenation of adjuncts. An 
adjunct sequence can be divided into three parts: first, a conjugated suffix of 
verb, adjective, or auxiliary verbs; second, a sequence of auxiliary verbs; third, a 
sequence of postpositional particles. Let us denote a conjugated suffix, an auxil- 
iary verb and a postpositional particle by Suf, AX, and FP, respectively. There 
are syntactic and semantic constraints in concatenation of Suf, AX, and PP. The 
syntactic constraint is relatively simple, and is easy to describe. It depends on the 
combination of a word itself and the conjugated form of the preceding word. On 
the other hand, the semantic constraint is not so easy to describe completely. 
Here, we consider only the constraint between AX and AX, and between PP 
and PP. We classified the category of AX into five subcategories, and developed 
rules according to the classification. We also classified PP into six subcategories, 
and applied rules in a similar way. The set C of adjunct sequences was thus 
defined. 

4.3 Experimental Results for Five Anthologies 

We applied the algorithm to five anthologies : Kokinshu, Shinkokinshu, Mini- 
SHU, Shuiguso, and Sankashu. See Table 1. The first two are imperial antholo- 
gies, i.e., anthologies compiled by imperial command, the first completed in 922, 
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Table 1. Five collections of WAKA poems. 



Anthology 


Explanation 


# poems 


Kokinshu 

Shinkokinshu 

Minishu 

Shuiguso 

Sankashu 


Imperial anthology compiled in 922 
Imperial anthology compiled in 1205 
Private anthology by Fujiwara no letaka (1158-1237) 
Private anthology by Fujiwara no Teika (1162-1241) 
Private anthology by Priest Saigyo (1118-1190) 


1,111 

2,005 

3,201 

2,985 

1,552 



and the second in 1205. The differences between the two anthologies, if any ex- 
ist, may be due to the time difference in compilation. On the other hand, the 
others are private anthologies of poems composed by the three contemporaries: 
Fujiwara no letaka (1158—1237), Fujiwara no Teika (1162-1241), and the priest 
Saigyo (1118-1190). Their differences probably depend on the poets’ personali- 
ties. 

Table 2 shows the results of the experiments. A great number of patterns 
occur in each anthology, and therefore it is impossible to examine all of them 
manually. In the second column, the values in parentheses are the numbers of 
patterns occurring more than once. In the experiment, we used these sets of 
patterns as A, the sets of candidate patterns. The size of coverings is shown 
in the third column. For example, 191 of 8,265 patterns were extracted from 
KokinshO. The coverings are relatively small in order to examine all the patterns 
within them. 



Table 2. Coverings of five anthologies. 



Anthology # occurring patterns Size of covering 



Kokinshu 


164,978 


(8,265) 


191 


Shinkokinshu 


233,187 


(12,449) 


270 


Minishu 


187,014 


(16,425) 


369 


Shuiguso 


214,940 


(14,365) 


335 


Sankashu 


279,904 


(12,963) 


232 



Table 3 shows the first five patterns emitted by the algorithm from Kokinshu. 
The first pattern +KEREBA+BERANARI* contains the auxiliary verb BERANARI, 
which is known to be used mainly in the period of Kokinshu. The fourth pat- 
tern *RISEBA+RAMASHI* Corresponds to the subjunctive mood. The last pattern 
*WA*NARIKERI* corresponds to the expression “I have become aware of the fact 
that ■ ■ . The remaining two patterns are different correlative word expressions, 

called KAKARI-MUSUBI. The obtained patterns are thus closely related to tech- 
niques used in composing poems. 

Next, we shall compare pattern occurrences in the five anthologies. Table 4 
shows the first five patterns for each anthology, where each numeral denotes 
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Table 3. FUSHI patterns from Kokinshu. 



Pattern Annotation 

*KEREBA*BERANARI* use of auxiliary verb BERANARI 
*ZO*SHIKARIKERU+ correlative word expression (kakari-MUSUBI) 
*KOSO*RiKERE* correlative word expression (kakari-musubi) 
+RISEBA+RAMASHI+ the Subjunctive mood 
+WA+NARIKERI+ the expression of awareness 



the occurring frequency of the pattern in the anthology. The following facts, for 
example, can be read from Table 4: 

1. Pattern *BAKARi+RAM* does not occur in either Kokinshu or 
Shinkokinshu. 

2. Pattern *WA*NARIKERI* occurs in each of the anthologies. In particular, it 
occurs frequently in Sankashu. 

3. Pattern *MASHI+NARISEBA* occurs frequently in Sankashu. 

4. Pattern *KOSO*RlKERE* does not occur in Shuiguso. 

5. Pattern *YA*RURAM* occur in each anthology except Kokinshu. 

It is possible that the above facts arc important characteristics that may be due 
to the times or the poets’ personalities. For example, (2) and (3) may reflect 
Priest Saigyb’s preferences, and (5) may imply that the pattern *YA*RURAM* 
was not preferred in the period of Kokinshu. Comparisons of the obtained 
patterns and their frequencies thus provide a WAKA researcher clues for further 
investigation. 



Table 4. FuSHi patterns from five anthologies with frequencies, where A, B, C, D 
and E denote Kokinshu, Shinkokinshu, Minishu, Shuiguso, and Sankashu, 
respectively. 





Patterns 


A 


B 


C 


D 


E 




Patterns 


A 


B 


C 


D 


E 




*KEREBA+BERANARI+ 


5 


0 


0 


0 


0 




*BAKARI*RAM*^ 


0 


0 


11 


8 


3 




+ZO+SHIKARIKERU* 


8 


1 


0 


0 


3 




+NO+NARIKERI* 


19 30 39 


19 49 


A 


*KOSO*RIKERE*"‘ 


11 


8 


8 


0 


13 


D 


+RAZARIKI + NO+ 


0 


0 


1 


6 


1 




+RISEBA+RAMASHI* 


5 


2 


0 


0 


4 




*YA*RURAM + ® 


0 


8 40 24 23 




*WA*NARIKERI*^ 


20 


26 


26 


11 


52 




*NI*NARURAM* 


0 


2 


8 


8 


7 




+KARISEBA+MASHI + 


3 


6 


0 


0 


1 




+MASHI+NARISEBA+"’ 


0 


2 


1 


0 


10 




+NO+NIKERUKANA + 


4 


11 


2 


1 


4 




+KOSO+KARIKERE + 


4 


4 


1 


0 


8 


B 


+WA*NARIKERI*^ 


20 


26 


26 


11 


52 


E 


+NARABA+RAMASHI+ 


1 


0 


0 


0 


8 




♦ KOSO+RIKERE*"* 


11 


8 


8 


0 


13 




+ 0+UNARIKERI* 


1 


0 


0 


0 


7 




+MO+KARIKERI + 


4 


11 


8 


5 


7 




+NO+RUNARIKERI + 


4 


3 


4 


0 


10 




+BAKARI+RURAM + 


0 


0 


6 


0 


3 


















+KOSO+NARIKERE + 


4 


0 


5 


0 


5 
















C 


+YA+NARURAM + 


0 


2 


16 


4 


7 


















+WA*NARIKERI*^ 


20 


26 


26 


11 


52 


















*NO*NARIKERI* 


19 


30 


39 


19 


49 
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5 A Pattern-Based Text Data Mining System 

Based on the experience of finding patterns from anthologies described in the 
previous section, we propose a text data mining system to support research into 
WAKA poetry. The system consists of two parts: the pattern discovery part and 
the pattern matching part. The pattern discovery part emits a set of patterns, 
some of which stimulate the user to form hypotheses. To verify the hypotheses, 
the user retrieves a set of poems containing the patterns by using the pattern 
matching part, examines the retrieved poems, and then updates the hypotheses. 
The updated hypotheses are then verified again. Repeating this process will 
provide results that are worthwhile to the user. 

In practice, a slightly modified pattern is often better than the original emit- 
ted by the pattern discovery part, that is, a slightly more general/specific pattern 
may be preferred. Say the user wants to browse the ‘neighbors’ of a pattern. The 
proposed system has as GUI a pattern browser for traversing the Basse diagram 
of the partial-order on the set of patterns. We have implemented a prototype of 
this system, and a new style of research into WAKA poetry utilizing the prototype 
system is now proceeding. 
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Abstract. High-dimensional data, such as documents, digital images, 
and audio clips, can be considered as spatial objects, which induce a met- 
ric space where the metric can be used to measure dissimilarities between 
objects. We propose a method for retrieving objects within some dis- 
tance from a given object by utilizing a spatial indexing/ access method 
R-tree. Since R-tree usually assumes a Euclidean metric, we have to em- 
bed objects into a Euclidean space. However, some of naturally defined 
distance measures, such as Li distance (or Manhattan distance), cannot 
be embedded into any Euclidean space. First, we prove that objects in 
discrete Li metric space can be embedded into vertices of a unit hyper- 
cube when the square root of L\ distance is used as the distance. To take 
fully advantage of R-tree spatial indexing, we have to project objects into 
space of relatively lower dimension. We adopt FastMap by Faloutsos and 
Lin to reduce the dimension of object space. The range corresponding to 
a query (Q, h) for retrieving objects within distance h from a object Q 
is naturally considered as a hyper-sphere even after FastMap projec- 
tion, which is an orthogonal projection in Euclidean space. However, it 
is turned out that the query range is contracted into a smaller hyper-box 
than the hyper-sphere by applying FastMap to objects embedded in the 
above mentioned way. Finally, we give a brief summary of experiments 
in applying our method to Japanese chess boards. 



1 Introduction 

Let S' be a finite set of objects {0i,02, ■ ■ ■ ,0^} and D : S x S ^ N he a 
function which gives the distance between objects. We call {S, D) an object space. 
A query is given as a pair [Q, h) of an object Q G S and a natural irumber h. 
The answer Ans(Q, h) to a query (Q, h) is the set of objects within distance h 
from Q, that is, 

Ans{Q,h) = {O, e S I D{Q,0,) < h}. 

The above setting of approximate retrieval as Ans{Q, h) is very natural and 
general. When (S, D) is a Euclidean space, most spatial indexing structures are 
almost directly used to realize approximate retrieval. In many cases, however. 
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unless objects are inherently geometrical like map information, object space is 
not Euclidean. 

In this paper, we assume {S, D) can be considered as a metric space based 
on discrete Li (or, Manhattan) distance, that is, 

n 

5 C AT" and ^ 

fe=i 



where 0\ ^ and O) ^ are the fc-th coordinates of objects Oi and Oj, respectively. 
Most of the difference measures might be captured as a discrete L\ distance. For 
example, a natural definition of distance between objects consisting of several 
attribute values may be the sum of the symmetric differences between each 
attribute values. This definition can be applied to many sort of objects, such as, 
documents, digital images, and game boards. 

We adopt R-trec [1,5] as a spatial indcxing/access method. As Otterman 
pointed out [7], R-tree can efficiently be used only for relatively low-dimensional 
objects. Therefore, we have to map high-dimensional objects into a subspace 
of lower dimension. We can use the FastMap method by Faloutsos and Lin [4] 
to project objects in Euclidean space into its subspace. Since FastMap is based 
on orthogonal projection in Euclidean space, we have to embed objects into a 
Euclidean space. However, L\ distance cannot be embedded into any Euclidean 
space, in general. As we will see in Section 2, if we take the square root of L\ 
distance as the distance, the objects can be embedded into a Euclidean space. 
In other words, if we define 

D^{X,Y) = V^(A,F), 



{S. D~) can be embedded into a Euclidean space. If we appropriately map ob- 
jects to vertices of unit no-cube, then the Euclidean distance between vertices 
coincides with the square root of the L\ distance between objects. 

Here, we briefly explain the FastMap method. Consider a set of objects 
{Oi, O 2 , ■ • • , Orn} in a Euclidean space, where <i(Oj, Oj) gives the Euclidean dis- 
tance between objects Oi and Oj. Let take arbitrarily a pair {Oa, Ob) of objects, 
which is called a pivot. The hrst coordinate Xi of an object Oi is given by 



X, = 0,E 



{d{o,,o,))^ + {d{Og,Ob))^ - {d{Ob,o,))^ 

2d{0,,0b) 



where E is the image of Oi by the orthogonal projection to the straight line OaOb 
(Figure 1). Here, we should note that distances between objects are enough to 
calculate the coordinate Xi and any coordinates of objects are not necessary. 
Let O' be the image of Oi by the orthogonal projection to the hyper-plane that 
is orthogonal to the straight line OaOb- The distance between O' and O' is given 

by 

( d ( 0 ', 0'))2 = ( d ( 0 „ 0,))2 - ( X , - X ,) 2 . 

Thus, we can repeatedly apply the above projection to get the second and other 
coordinates of objects. One of the most important issues in applying FastMap 
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may be how to select pivots. Intuitively, the better pivot should provide the more 
selectivity in retrieval. Details are discussed in [4]. 



Oi 




Fig. 1. Orthogonal projection to pivot line 



Let p be an orthogonal projection to which is obtained by FastMap, 

where (S', D) is the original object space and D~ is used as the distance function 
in applying FastMap. We call the index space. Since p is an orthogonal 

projection, the distance between images of objects in the index space is not larger 
than the square root of the distance between objects, that is, d{p{Oi),p[Oj)) < 
{Oi,Oj). For a query {Q,h), we have 

{O, e S I d{p{Q),p{Oi)) < Vh} D Ans{Q,h). 

Therefore, we can retrieve all the necessary objects even after reducing dimension 
by FastMap. Such a retrieval is easily realized by using spatial access method 
like R-tree. The result from the method may include irrelevant objects to the 
query, which is caused by FastMap projection. To get exact answer, screening 
might be needed. 

From the experiments of our method, we observed that the image of the query 
range in the index space , which is naturally considered as a fco-sphere with 
radius h“, is too large to get all the necessary objects. Precisely, we can prove 

{Oi e 5 I Ip^'^HQ) - < Xkh for all k = l,...,ko}D Ans{Q,h), 

where p^^'> (O) is the fc-th coordinate of the image of O in the index space and Afc is 
a constant which is usually much smaller than 1. Thus, the query range of fco-box, 
which is smaller than the fco-sphere, is enough to retrieve the correct answer. This 
phenomenon, which is derived from the combination of our object embedding 
into unit riQ-cube and FastMap, will be theoretically explained as the contraction 
of query range by FastMap in Section 3. 

2 Embedding Distance into Euclidean Space 

Theorem 1. For any object space (S,D), [S,D~) can be embedded into Eu- 
clidean space. 



144 



Takeshi Shinohara et al. 



Proof. Without loss of generality, we assume that S — {Oi, , Om} C W". For 
each fc = 1, . . . , n, we use a bit vector of length bk where bk = m.ax{0^^^ \ i = 
1, . . . ,m} and map the fc-th coordinate value v to which is 

a bit vector such that the first v bits are 1 and other bits are 0. Here we iden- 
tify bit vectors and bit strings. For each object Oj, we map Oi to a bit vector 
u{Oi) = Uh^ ■ • ■ Uh^ Clearly, for any k £ {0, . . . , n} and any 

v,v' £ {0,...,6fc}, {d{ubi^{v),Ubf^{v')))'^ = Therefore, d{u{Oi),u{Oj)) = 

D~{Oi,Oj). Thus, we can embed {S, D) into unit uo-cube, where 
no — bi + ■ ■ ■ + bn. Q 



For example, consider points 0(0,0), H(0,1), H(0,2), and 0(1,1) in x-y 
plane as in Figure 2. Distances between these points based on Li metric are 
D{0,A) = D{A,B) = D(H,0) = 1 and D{0,B) = D{0,C) = D{B,C) = 2. 
As long as using D as the metric, A should be on the straight line OB and O, B, 
and C should make a regular triangle no matter what embedding is used. The 
height of regular triangle OBC is the square root of 3, which contradicts to 
D{A,C) = 1. Thus, there is no Euclidean space where these four points are 
embedded keeping the metric as it is. The maximum values of x and y coordinates 
are 1 and 2, respectively. Therefore, we map each point to a bit vector of length 
no = 1 + 2 = 3. We can regard the first bit as representing if the x coordinate is 
equal to 1, the second as if the y coordinate is greater than or equal to 1, and the 
third as if the y coordinate is equal to 2. Clearly Euclidean distances between 
bit vectors are equal to respective Li distances between points in x-y plane. 




Fig. 2. Embedding L\ 



X = 1 y y = 2 

0(0,0) □ hO, 0, 0) 

A(0,1) □ hO, 1, 0) 

SCO, 2) □ LCO, 1, 1) 

c(i,i) □ m, 1, 0) 

distance into unit no-cube 



From Theorem 1, we can apply FastMap to which is embedded in 

Euclidean space of no-dimension. Here we should note that only distances be- 
tween objects are sufficient for applying FastMap and actual values of coordinate 
in the Euclidean space are not necessary. Thus, the dimension no of the Euclidean 
space, which may be quite larger than that of the original object space, does not 
the matter when we use FastMap. 
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3 Contraction of Query Range by FastMap 



From Theorem 1, we can assume that every object is a vertex of a unit no- 
cube and the value of each coordinate is 0 or 1. Let Pq be the vector between 
two objects used as the first pivot for FastMap. Let e be a unit vector of any 
coordinate in . Then the length of the image of e by Let e be a unit vector 
of any coordinate in R"'’ . Then the length of the image of e by orthogonal 
projection to the first pivot is given by 

|e-Po| 

|Po| ■ 

Since every component of Pq is either —1, 0, or 1, the inner product e ■ Pq is 
also —1, 0, or 1. Therefore, 

|e-Po| ^ J_ 

|Po| “ iPoT 

Let define Ai as the right side of the above inequation. Consider two ob- 
jects Oi and O 2 such that D{ 0 i, 02 ) — h and a vector v between Oi and O 2 . 
Clearly, exactly h components of v are —1 or 1 and all the other components 
are 0. Therefore, the length of the image of v is less than or equal to hXi. 
Since |Po| is usually larger than 1, Ai is relatively small. For the second and 
other projections by FastMap, similar phenomena can be derived. In what fol- 
lows, to avoid a little bit complicated discussion, we give only the result. 

Let Pfe be the (fc+l)-th pivot. Let Ho = Pq, and 11^ be the image of Pfc by or- 
thogonal projection to the hyper-plane Hk-i, which is orthogonal to Ho, Hi, . . . , 
and Define /?(fc, 1), and j{k, 1) for each fc > 1 > 0 by 



f3{kj) 



Fk Pi 

IP; I 



fc-i 



l{k,l) = P{k,l) - ^ (3{k,i)^{i,l). 



Then, for each fc > 0, 11^ is given as 



i=l+l 



k-1 



life = Pfc - y^7(fc,?)P;. 



;=o 



Finally, as for the length of the image of a unit vector e by the fc-th orthogonal 
projection of FastMap, we have the following upper bound Afc. 



|e ■ Hfel ^ 1 + Er^o ^ 



Ilfcl |IIfe 

Theorem 2. For any query {Q, h) and any FastMap p, 

{Oi e S I — P^^\Oi)\ < Afc/i for all fc = 1, . . . , fco} 3 Ans{Q, h). 
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The answer Ans(Q, h) to a query (Q, h) is given as a set of objects within 
a no-sphere whose center and radius are Q and h~ , respectively, where D~ is 
used as a Euclidean distance, which we call a query range. Since a mapping p 
obtained by FastMap is an orthogonal projection into Euclidean space, the image 
of a query range by p is a fco-sphere of the same radius h~ in the index space. On 
the other hand, Theorem 2 says that all the objects in Ans{Q, h) are projected 
by p into a fco-box whose center and radius of the fc-th coordinate are p{Q) 
and Afe, respectively. Here we should note that the constant Afe is usually much 
smaller than 1, and therefore, the fco-box has a smaller volume than the fco-sphere 
for relatively small h. Let Aq = max{Afc | 1 < fc < fco}- Then, the volume Vb of 
the fco-box is less than or equal to (2Ao^)*“. On the other hand, the volume Vs 
of the fco-sphere is Ckoh~^° , where Cr is a constant determined by r and > 1 
for any r < 12. Therefore, Vb < Vs whenever 2Xh < 1 and fco < 12. Although 
this estimation is very rough, in many cases we may expect the contraction of 
query range by FastMap, which is illustrated in Figure 3. Since the square root 
of h is used as the radius of fco-sphere query range while A^fc, is used for fco-box, 
as low as possible dimension should be selected to get much effect of contraction 
of query range by FastMap. 



Contracted Query Range 




4 Experimental Results — Japanese Chess Boards 

In this section, we give a brief summary of the experiments in applying our 
indexing method to retrieval of Japanese chess (shogi) boards analogous to given 
one from 40,412 boards drawn by 500 play records. 

Shogi uses 40 pieces of 8 sorts and reverse side of 6 sorts of pieces. A shogi 
board consists of 9 x 9 = 81 positions, each of which may be possessed by one 
of (8 + 6) X 2 = 28 sorts of pieces, and two sets of captured pieces, which is a 
subset of 38 (all but 2 Kings) pieces. 
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4.1 Distance between Boards and Embedding into Unit Hyper-cube 

For each position, we define the difference between two boards Oi and Oj depend- 
ing on what pieces are each on the position. When two positions are the same, 
that is, they have the same piece or both of them have no piece, the difference 
is 0. When one has a piece and the other has no, the difference is 1. Otherwise, 
they have different pieces and the difference is defined as 2. For captured pieces, 
the difference is the sum of the symmetric difference of the numbers of pieces 
of each sort. We define the distance D{Oi, Oj) as the sum of differences for all 
positions and captured pieces. Note that the largest possible distance between 
boards is 80 because all 40 pieces should be put on some position or included in 
captured pieces. 

By using 28 bits for each of 81 positions and 38 bits for each of two sets of cap- 
tured pieces, we can put shogi boards in a unit hyper-cube 
of 28 X 81 + 38 X 2 = 2,344 dimensions, where the distance D{Oi,Oj) is given 
by Li metric. Thus, we can regard shogi pieces as object space. 

4.2 FastMap Projection and R-Tree Spatial Indexing 

FastMap projection was applied to 40,412 shogi boards to reduce the dimension 
of boards and efficiently utilize R-tree spatial indexing. Selection of pivot for each 
step of FastMap was done by randomly choosing 500 candidates and selecting 
one that maximizes the variance of coordinate values. As for the dimension fco 
of the index space, we adopted 5, 7, and 10. We used off-line packed R-trees [6] 
based on Hilbert space filling curves [2,3]. 

4.3 Effect of Contraction of Query Range by FastMap 

After projecting boards by FastMap for each fco = 5, 7, or 10, we measured 
the maximum lengths of the image of a unit vector in unit hyper-cube on each 
coordinate in index space, which are between 0.12 and 0.21. On the other hand. 
Theoretical bounds derived from Theorem 2 are between 0.12 and 0.25. The 
gap between actual measurements and theoretical bounds seems to suggest that 
there are no worst combinations within current shogi boards. Since the maximum 
possible distance between boards is 80, the lower bound of Afc is given by 

A. >^ = 0.112, 

for each 1 < fc < fcp. From this, we observe that query ranges are contracted into 
relatively small fco-box by FastMap projection. 



4.4 Approximate Retrieval of Boards 

Finally, we made experiments of retrieval of boards analogous to given one by 
using R-tree index. For boards given as the centers of queries, we randomly 
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selected 700 boards from 40,412 boards. For the radius of queries we gave 
h — Q, 2, 4, 6 , 8 . Retrievals in case h — 0 are exact ones. The averages of elapsed 
time for retrievals arc summarized in Table 1, where the column “none” repre- 
sents the average time of retrieval without indexing. From Table 1, our indexing 
are efficient for small radius. Within 5, 7, and 10, the best for the dimension ko 
of index space is 7 for all radiuses. 



Table 1. Elapsed time of retrieval 



ko 


h = Q 


h = 2 


/i = 4 


h = 6 


00 

II 


5 


0.0098 


0.0501 


0.2351 


0.8642 


1.8316 


7 


0.0097 


0.0452 


0.2326 


0.8384 


1.8104 


10 


0.0134 


0.0682 


0.2549 


0.8720 


1.8983 


none 


1.9803 


2.3000 


2.4301 


2.4517 


2.5608 



As mentioned in Section 4, the lower dimension is desired for contraction of 
query range. For example, let Aq = maxjAfe | 1 < fc < fco} = 0.25, which is con- 
sistent with our experiments. The volume Vsika) of fco-<^nbe with 
radius X^h and the volume Vs(fco) of fco-sphere with radius h~ are given by 
yn{ko) = (2Ao/i)^“ and Rs(fco) = , respectively, where C 5 = = 5.26, 

C'j — = 4.72, and Cio = = 2.55. These volumes are summarized in 

Table 2. From this, Vb is larger than Vs for all cases of h = 8 and a case of h = 6 
and k.Q = 10 , which may in part explain the tradeoff between the dimension of 
index space and elapsed time of retrieval. 



Table 2. Hyper-box vs. Hyper-sphere as Query Range 



ko 


Vb 


Vs 


h = 2 


/i = 4 


h = 6 


/i = 8 


h = 2 


/i = 4 


h = & 


00 

II 


5 


1 


32 


243 


1024 


29.8 


168 


464 


952 


7 


1 


128 


2187 


16384 


53.4 


604 


2497 


6835 


10 


1 


1024 


59049 


1048576 


81.6 


2611 


19829 


83558 



The tradeoff also may be explained by a nature of R-trees. In other words, 
higher dimension of index space gives more precise image but more difficulty in 
spatial indexing by R-tree. 
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5 Concluding Remarks 

In this paper, we have proposed a method for approximate retrieval by using spa- 
tial indexing/access method like R-tree, where dissimilarities between objects are 
measured by L\ distance. As Theorem 1, we proved that objects with L\ dis- 
tance can be embedded into a Euclidean space preserving the square root of Li 
distance as distance. In Theorem 2, we pointed out that contraction of query 
range by FastMap can be expected when our embedding is used. Although the 
experiments on approximate retrieval of Japanese chess boards seem to suggest 
that our method can be successfully applied to many other cases, we should run 
experiments in other natural applications of our method to analyze its applica- 
bility. 

As for future work, we should enlarge the applicability of our method. Al- 
though we assumed that distance between objects is measured by discrete Li 
distance, proposed embedding of taking the square root of distances might be 
applicable to other distances, such as edit distance between strings. As for strings 
with edit distance, we made experiments that reported no inconsistency in em- 
bedding and FastMap projection. The restriction of discreteness might be re- 
laxed. 
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